ariel-miculas commented on issue #21446:
URL: https://github.com/apache/datafusion/issues/21446#issuecomment-4278605042

   Hey @diegoQuinas , feel free to pick this up.
   > avoids needing ~100 GB of free disk for decompression.
   
   it's around 230GiBs; I had some python code to trim down the json, since 
it's quite a lot of data, maybe it's useful:
   * https://gist.github.com/ariel-miculas/a7c7297e2d9e81c6e805ae9b5908b2c3
   * https://gist.github.com/ariel-miculas/a2582690002c2fc1936b1db06e1efe48
   
   > NdJsonReadOptions::default().file_compression_type(GZIP)
   
   one issue is that gzipped files cannot be split, and for my use case I 
needed to exercise the code path that reads file ranges instead of entire files
   
   There's another adjustment that needs to be made, I didn't check the real 
file size after downloading hits.json.gz, and that's used for avoiding to 
redownload the file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to