afeldman1 commented on a change in pull request #1761:
URL: https://github.com/apache/hudi/pull/1761#discussion_r448009565
##########
File path: docs/_docs/2_3_querying_data.md
##########
@@ -136,6 +136,16 @@ The Spark Datasource API is a popular way of authoring
Spark ETL pipelines. Hudi
datasources work (e.g: `spark.read.parquet`). Both snapshot querying and
incremental querying are supported here. Typically spark jobs require adding
`--jars <path to jar>/hudi-spark-bundle_2.11-<hudi version>.jar` to classpath
of drivers
and executors. Alternatively, hudi-spark-bundle can also fetched via the
`--packages` options (e.g: `--packages
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3`).
+### Snapshot query {#spark-snap-query}
+This method can be used to retrieve the data table at the present point in
time.
+
+```scala
+val hudiIncQueryDF = spark
+ .read()
+ .format("org.apache.hudi")
+ .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
+ .load(tablePath + "/*") //Include "/*" at the end of the path if the
table is partitioned
Review comment:
This is a separate point from documentation, but ideally wouldn't it be
better for Hudi to figure out which sub-directories need to be read for the
partitions, instead of expecting to be passed the tree level of the partitions?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]