[
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090098#comment-17090098
]
Udit Mehrotra edited comment on HUDI-829 at 4/22/20, 11:59 PM:
---------------------------------------------------------------
You may also want to look at my implementation of custom relation in Spark to
read bootstrapped tables
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
. Here I am building my own file index using spark's InMemoryFileIndex, but
the filtering part is just one operation now because once I have all the files,
the hudi filesystem view is created just once to get latest files. Its still
work in progress and I am yet to see how fast this is going to be. We can
consider moving to a place where our reading in spark happens through our
relations, and underneath we use the native readers.
was (Author: uditme):
You may also want to look at my implementation of custom relation in Spark to
read bootstrapped tables
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
. Here I am building my own file index using spark's InMemoryFileIndex, but
the filtering part is just one operation now because once I have all the files,
the hudi filesystem view just once to get latest files. Its still work in
progress and I am yet to see how fast this is going to be. We can consider
moving to a place where our reading in spark happens through our relations, and
underneath we use the native readers.
> Efficiently reading hudi tables through spark-shell
> ---------------------------------------------------
>
> Key: HUDI-829
> URL: https://issues.apache.org/jira/browse/HUDI-829
> Project: Apache Hudi (incubating)
> Issue Type: Task
> Components: Spark Integration
> Reporter: Nishith Agarwal
> Assignee: Nishith Agarwal
> Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of
> spark with Hudi tables.
> My understanding is that when you read Hudi tables through spark-shell, some
> of your queries are slower due to some sequential activity performed by spark
> when interacting with Hudi tables (even with
> spark.sql.hive.convertMetastoreParquet which can give you the same data
> reading speed and all the vectorization benefits). Is this slowness observed
> during spark query planning ? Can you please elaborate on this ?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)