hkpeaks commented on issue #35688: URL: https://github.com/apache/arrow/issues/35688#issuecomment-1559703150
1. I will try this "C" functions whether fit for my purpose. I want to keep one file databending.go as a proprietary software, so I need to build it as a binary file similar to dll of C#. Without this file, the open source project I have published is no way for users to build their own runtime for Windows/Linux. If full project Go open source, it is no way to support this project financially. 2. If the databending process do not deal with math/statistics, I can use byte-to-byte conversion directly. If particular column which demand for filter by float type, the bytearray will be converted to float on demand basis for compare 2 float64 number only, only affect current row select or not select. 3. Not knowing whether your Parquent has built-in parallel streaming. If no, I will do that for my project. I will read different row blocks of selected columns in parallel. I will calculate how much block can fit for memory, if over the memory, will run by batch similar to current CSV streaming. 4. Bytearray is exactly same as bytestream read from the CSV file using ReadAt. And it is grouping into different partitions. For a 1 billion rows cases (67GB) CSV, it is divided into 144 batches of streams by default. Each batch is further divided by 10 partitions run in parallel. If you have time to try the pre-release, you will feel the real performance. This means 1 billion rows CSV is divided by 1,440 partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
