fireking77 opened a new issue #4340:
URL: https://github.com/apache/hudi/issues/4340
Hi Guys!
I use HUDI with the following setup:
AWS EMR 6.4 HUDI 0.8, PySpark 3.1.2
The problem is when you try to read from a HUDI dataset incrementaly
`
df = (spark.read
.format('org.apache.hudi')
.options(**{
"hoodie.datasource.query.type": "incremental",
"hoodie.datasource.read.begin.instanttime":
<start_instant_time>,
"hoodie.datasource.read.end.instanttime":
<stop_instant_time>,
})
.load(base_path_ignition + partition_pattern_telemetry))
`
it fails: when the particular time zone is empty
`
Traceback (most recent call last):
File
"/tmp/pycharm_project_151/aggregates_by_ignition/ignition_aggregate/ignition-aggregate.py",
line 510, in <module>
spark_dag()
File
"/tmp/pycharm_project_151/aggregates_by_ignition/ignition_aggregate/ignition-aggregate.py",
line 95, in spark_dag
.load(base_path_ignition + partition_pattern_telemetry)
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/readwriter.py",
line 204, in load
return self._df(self._jreader.load(path))
File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line
1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line
111, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.7/site-packages/py4j/protocol.py", line 328,
in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o221.load.
: java.util.NoSuchElementException: No value present in Option
at org.apache.hudi.common.util.Option.get(Option.java:88)
at
org.apache.hudi.MergeOnReadIncrementalRelation.buildFileIndex(MergeOnReadIncrementalRelation.scala:173)
at
org.apache.hudi.MergeOnReadIncrementalRelation.<init>(MergeOnReadIncrementalRelation.scala:79)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:109)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:63)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Process finished with exit code 1
`
Ofcourse it can be handled, but it woul be better to return an empty
dataframe, with 0 partion
new_ignition_inc_df.rdd.getNumPartitions() == 0
or
a meaningfull exception (not this one: : java.util.NoSuchElementException:
No value present in Option)
Thanks,
Darvi
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]