[GitHub] [arrow] hkpeaks commented on issue #35688: How to approach to implement Parquet-Go file format

via GitHub Mon, 22 May 2023 18:15:19 -0700


hkpeaks commented on issue #35688:
URL: https://github.com/apache/arrow/issues/35688#issuecomment-1558314840


   Based on numerous experiments, Golang readat has achieved the results of 
mmap. As a result, I am much more concerned with performance than with the name 
"mmap." I created one of the fastest dataframes specialized in CSV file. So I 
recently investigated whether my project should cover Parquet format. Other 
formats, such as JSON and HTML, are simple for me; no research is required. 
   
   You can see some of my code because I'm working on reclassifying my Peaks 
project as open source and proprietary. I previously tested Apache Arrow to see 
how it compared to slice, but I saw no improvement in performance. However, I 
eventually remove slice's in-memory table structure from my project. When 
supporting ETL functions such as "Distinct", "GroupBy", "Filter", and 
"JoinTable", the use of bytearray as an in-memory table has been shown to yield 
significant performance gains. Bytearray reduces memory and CPU usage 
significantly. It avoids unnecessary dataset serialization and de-serialization.
   
   I've been retired for three years as a result of the COVID-2019 layoff 
exercise. I've recently become very active in programming and marketing because 
I intend to return to work when I'm now 55 years old. One of my goals is to 
contribute to the open source community. I'll think more about the benefits of 
using Arrow for CSV; I believe the main benefit is data exchange. However, gRPC 
is also an excellent way to support very high data exchange performance over 
the internet.
   
   My strength is innovative system design; when I design something in my head, 
coding becomes simple. My weakness, however, is reading code. I'll try to 
replicate what I've learned from 
https://pkg.go.dev/github.com/apache/arrow/go/v12/arrow/csv#WithChunk. I 
understand that users want a fast Apache Spark running on their desktop 
computer. Cloud computing is always expensive and risky when paid with a credit 
card. As a result, I am unable to use the cloud computing services provided by 
Databrick, Azure, Google, and AWS. Peaks Consolidation is designed to address 
this issue by allowing users to process billions of rows on a desktop computer. 
I will request cloud companies to support prepayment by Paypal. Simplex model 
is, one prepaid balance utilized, all VMs be removed automatically.
   
   Thank you for your detailed reply to my question. I will consider whether it 
is possible to implement the Parquet format, which can outperform my current 
CSV format. I noticed that writing a Parquet file requires much more time than 
writing a CSV file, as I have seen in other software. And I hope Apache 
Foundation can consider bytearray is one of best data exchange format moving 
from one software to alternative software.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] hkpeaks commented on issue #35688: How to approach to implement Parquet-Go file format

Reply via email to