jychen7 commented on issue #5404:
URL: 
https://github.com/apache/arrow-datafusion/issues/5404#issuecomment-1445370933

   > Lazy projection(aka Later projection) can improve this case mostly, with 
this we just fetch URL column at the first query and apply the order limit then 
projection other columns by rowids.
   
   wow, you are right, select one `URL` column makes datafusion and duckdb 
closer, 1.75x.
   
   ```
   # datafusion v19.rc1
   > SELECT "URL" FROM hits WHERE "URL" LIKE '%google%';
   // 15911 rows in set. Query took 5.626 seconds
   
   # duckdb v0.6.1
   > SELECT URL FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%';
   // 15911 rows. Run Time (s): real 3.207 user 34.520757 sys 1.555060
   ```
   
   > URL is a large binary column in the hits dataset, duckdb optimized reading 
parquet to it's memory model (reused the original buffer). You can prove that 
by select max(URL) from table
   
   thanks for the info. Do you happen to know the code/blog link to "reused the 
original buffer"?
   I did a test and found datafusion and duckdb performs basically same in 
`SELECT max("URL") FROM hits`
   
   ```
   # datafusion v19.rc1
   SELECT max("URL") FROM hits;
   +-----------------------------------------+
   | MAX(hits.URL)                           |
   +-----------------------------------------+
   | https://yugra-advert2792270][to]=&input |
   +-----------------------------------------+
   1 row in set. Query took 2.726 seconds.
   
   # duckdb v0.6.1
   ┌─────────────────────────────────────────┐
   │               max("URL")                │
   │                 varchar                 │
   ├─────────────────────────────────────────┤
   │ https://yugra-advert2792270][to]=&input │
   └─────────────────────────────────────────┘
   Run Time (s): real 2.746 user 28.837000 sys 2.205152
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to