[GitHub] [arrow-datafusion] sitano opened a new issue, #2930: bug: executor does not understand the remote storage cost and reads the whole file instead of a chunk

GitBox Sat, 16 Jul 2022 09:43:18 -0700


sitano opened a new issue, #2930:
URL: https://github.com/apache/arrow-datafusion/issues/2930


   **Describe the bug**
   
   If you will take a 10 GB file from a S3 remote storage the following request:
   
   > SELECT * FROM test LIMIT 1;
   
   will try to read the WHOLE file (10GB) instead of just a first row (chunk).
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   1. Put 1GB CSV file to S3
   2. Add s3 contrib object store that is fine
   ```
   //  let mut ctx: Context = Context::new_local(&session_config);
       let mut ctx = {
           let runtime = RuntimeEnv::new(RuntimeConfig::default()).unwrap();
           runtime.register_object_store("s3", 
Arc::new(S3FileSystem::default().await));
           Context::Local(SessionContext::with_config_rt(
               session_config.clone(),
               Arc::new(runtime.clone()),
           ))
       };
   ```
   3. CREATE EXTERNAL TABLE test (...) STORED AS CSV WITH HEADER ROW LOCATION 
's3://blah/blah.csv';
   4. SELECT * FROM test LIMIT 1; 
   
   ```
   list file from: s3://blah/blah.csv
   sync_chunk_reader: 0-10428263736
   sending get object request blah/blah.csv
   ArrowError(ExternalError(Custom { kind: TimedOut, error: AWS("Timeout") }))
   ```
   
   **Expected behavior**
   
   It must read only a small chunk that is enough to execute the LIMIT 1 query.
   
   **Additional context**
   
   The contrib module is fine... It's an engine that requests this epic lenght.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] sitano opened a new issue, #2930: bug: executor does not understand the remote storage cost and reads the whole file instead of a chunk

Reply via email to