Re: [I] `select count() from '.parquet'` scans all parquet file recursively including subdirectory [arrow-datafusion]

via GitHub Thu, 14 Dec 2023 20:17:03 -0800


zhangxffff commented on issue #8524:
URL: 
https://github.com/apache/arrow-datafusion/issues/8524#issuecomment-1857250251


   > I think the reason this happens is that the `ListingTable` does 
`ObjectStore::list` which finds files in all subdirectories
   > 
   > > I wonder is this behavior by design or a bug.
   > 
   > As I understand it, DataFusion is trying to model the behavior of "Hive 
PartitionedTables" -- so to answer this question I think we need to research 
what Hive does in this case
   
   I tried with hive external table stores as parquet, it seems that hive 
external table also do not scan parquet file in subdirectory
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/91bbd812-1d16-4268-9d3b-2a8dcde821c8)
   
   as show in this picture, when location is 
`hdfs:///user/hive/warehouse/zxf_test/`, there is no data in external table, 
when localtion is `hdfs:///user/hive/warehouse/zxf_test/subdir`, external table 
has two records from two parquet file.
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/d34fc48f-3c73-4bb4-bed6-39c6a154cf98)
   
   I also tried partitioned external table.
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/336b4247-8031-40f2-bd3f-d4f18b44c1fd)
   
   After create table, there is no data.
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/0fb69b22-ba44-43af-a073-f313dd4d51f4)
   
   After specify location of partition `pt1`, we can get data from 
`hdfs:///user/hive/warehouse/zxf_test_pt/pt1`
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/10eff031-51b9-435f-a587-b9bb37de2d37)
   
   After also specfy location of  partition `pt2`, we can get data from both 
`hdfs:///user/hive/warehouse/zxf_test_pt/pt1` and 
`hdfs:///user/hive/warehouse/zxf_test_pt/pt2`
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/c7de3f7e-0a95-4358-b2d7-2469b6db1651)
   
   If I copy a subdirectory with parquet file into hive partition directory, 
hive report a `java.io.IOException:java.io.IOException: Not a file`
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/43817c4e-99e1-48e0-88e6-f327ba2d5ddf)
   
![image](https://github.com/apache/arrow-datafusion/assets/3616081/e05fa691-178a-45ae-8123-6d9f94e98e59)
   
   So it seems that hive also do not scan parquet file in the subdirectoy. for 
hive partitioned table, user should specify the directory of each partition, 
and there should not contains any subdirectory.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `select count(*) from '*.parquet'` scans all parquet file recursively including subdirectory [arrow-datafusion]

Reply via email to

Re: [I] `select count() from '.parquet'` scans all parquet file recursively including subdirectory [arrow-datafusion]