westonpace commented on PR #35440:
URL: https://github.com/apache/arrow/pull/35440#issuecomment-1604505953

   I've gone through and done some more thorough testing now.  I was surprised 
to see we even have a big improvement when running in EC2, even if we're in the 
same datacenter (where the latency should be very low).  I tested with two test 
datasets.
   
   The first test dataset (s3://ursa-qa/wide-partition) was 10,000 files split 
into 10,000 folders nested two deep:
   
   ```
   x=0/y=0/file.parquet
   x=0/y=1/file.parquet
   ...
   x=0/y=99/file.parquet
   x=1/y=0/file.parquet
   ...
   x=99/y=99/file.parquet
   ```
   
   The second dataset (s3://ursa-qa/flat-partition) was 10,000 files in the 
root folder:
   
   ```
   file-0.parquet
   file-1.parquet
   ...
   file-9999.parquet
   ```
   
   For all of my tests I timed how long it took to recursively listed all files 
in the dataset.  I ran the tests on my local desktop (outside of EC2, on an EC2 
server that was in a different region (us-west-2) and on an EC2 server that was 
in the same region (us-east-2).  All times are in seconds.
   
   There were (as hoped) significant performance improvements in the 
wide-partition dataset:
   
   ![Screenshot 2023-06-23 at 08-43-34 Online Graph Maker · Plotly Chart 
Studio](https://github.com/apache/arrow/assets/1696093/6819bdd9-30f8-48ec-b44d-792b335ee2ad)
   
   Regrettably, there might be a slight regression in the flat-partition 
dataset, although it is largely within the noise.  I have run the tests 
frequently enough that I feel it is stable:
   
   ![Screenshot 2023-06-23 at 08-44-36 Online Graph Maker · Plotly Chart 
Studio](https://github.com/apache/arrow/assets/1696093/53fcd6e0-1f3f-47ef-8f4e-f264ecd3770a)
   
   I've verified that, in both the old and new approach, we are sending the 
same number of HTTP messages to S3 and the content is very close (less than 300 
bytes difference).  I don't think it is additional compute time (or else I'd 
expect to see a worse regression on the AWS servers).
   
   We could keep both styles (tree walking and recursive listing) but I don't 
think this regression is significant enough to justify the complexity.
   
   There is one other case that would likely regress.  That is the case where 
the data is deeply partitioned (e.g. each file is 4 or 5 folders deep) and the 
user specifies a low max recursion.  For example...
   
   ```
   a=0/b=0/c=0/d=0/e=0/file-0.parquet
   ...
   a=0/b=0/c=0/d=0/e=0/file-9999.parquet
   ```
   
   I would expect no regression if I fully listed the above dataset.  However, 
if I listed the above dataset with a max_recursion of 2 then the old approach 
would likely be much faster since it only needs to return 1 file info (the new 
approach would return all 10k file infos and then we would pare them down in 
memory).  I'm not aware of anyone using this use case (pyarrow doesn't even 
expose max_recursion) and so I'm not sure if it is worth the complexity of 
keeping both approaches.  Even in this case I suspect we would be looking at a 
difference of 0.3 vs 3 seconds which is better than our current worst case 
(~100 seconds).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to