hkpeaks commented on issue #35688:
URL: https://github.com/apache/arrow/issues/35688#issuecomment-1559703150

   1. I will try this "C" functions whether fit for my purpose. I want to keep 
one file databending.go as a proprietary software, so I need to build it as a 
binary file similar to dll of C#. Without this file, the open source project I 
have published is no way for users to build their own runtime for 
Windows/Linux. If full project Go open source, it is no way to support this 
project financially. 
   
   2. If the databending process do not deal with math/statistics, I can use 
byte-to-byte conversion directly. If particular column which demand for filter 
by float type, the bytearray will be converted to float on demand basis for 
compare 2 float64 number only, only affect current row select or not select.
   
   3. Not knowing whether your Parquent has built-in parallel streaming. If no, 
I will do that for my project. I will read different row blocks of selected 
columns in parallel. I will calculate how much block can fit for memory, if 
over the memory, will run by batch similar to current CSV streaming. 
   
   4. Bytearray is exactly same as bytestream read from the CSV file using 
ReadAt. And it is grouping into different partitions. For a 1 billion rows 
cases (67GB) CSV, it is divided into 144 batches of streams by default. Each 
batch is further divided by 10 partitions run in parallel. If you have time to 
try the pre-release, you will feel the real performance. This means 1 billion 
rows CSV is divided by 1,440 partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to