ariel-miculas commented on issue #21446: URL: https://github.com/apache/datafusion/issues/21446#issuecomment-4278605042
Hey @diegoQuinas , feel free to pick this up. > avoids needing ~100 GB of free disk for decompression. it's around 230GiBs; I had some python code to trim down the json, since it's quite a lot of data, maybe it's useful: * https://gist.github.com/ariel-miculas/a7c7297e2d9e81c6e805ae9b5908b2c3 * https://gist.github.com/ariel-miculas/a2582690002c2fc1936b1db06e1efe48 > NdJsonReadOptions::default().file_compression_type(GZIP) one issue is that gzipped files cannot be split, and for my use case I needed to exercise the code path that reads file ranges instead of entire files There's another adjustment that needs to be made, I didn't check the real file size after downloading hits.json.gz, and that's used for avoiding to redownload the file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
