emaynardigs commented on issue #27155: [SPARK-17636][SPARK-25557][SQL] Parquet 
and ORC predicate pushdown in nested fields
URL: https://github.com/apache/spark/pull/27155#issuecomment-574828934
 
 
   > Hello @emaynardigs ,
   > 
   > Thank you for your contribution, and I do value your work a lot. In fact, 
at Apple, we are still using an updated version of #22535 which is critical to 
our production job. As far as I know, Databirkcs's runtime also has an 
implementation with similar approach to tackle this issue.
   > 
   > The reason why I am inactive on my previous PR is that I feel adding 
nested support to the current filter api is a short term solution since the 
design doesn't consider this complex use-cases. For a better long term 
solution, I would like to create a new set of FilterV2 apis in DSv2 framework 
that makes nested columns as first class support. + @cloud-fan @rdblue @viirya 
for feedback on this.
   > 
   > I already started to work on FilterV2 api, and here is WIP code 
https://github.com/dbtsai/spark/pull/10/files . The change is bigger than I 
thought, and now, I am debating do we actually need a new FilterV2 framework?
   > 
   > Feedback and idea are welcome.
   > 
   > Thanks.
   
   Hey @dbtsai no worries, actually I suspected the silence was because you had 
moved this into a fork and were running with it :)
   
   Actually I think the core approach you took here is sufficient for most 
cases, right? The crux of my change was only porting it to the new APIs and 
looking at the schema itself to unpack nested columns instead of looking at the 
column name (needed this for ORC anyway). Then it was pretty easy to add ORC 
support as we use a fork of ORC internally while you guys use Parquet.
   
   What complex cases do you think break under this PR?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to