hkpeaks commented on issue #35688: URL: https://github.com/apache/arrow/issues/35688#issuecomment-1558314840
Based on numerous experiments, Golang readat has achieved the results of mmap. As a result, I am much more concerned with performance than with the name "mmap." I created one of the fastest dataframes specialized in CSV file. So I recently investigated whether my project should cover Parquet format. Other formats, such as JSON and HTML, are simple for me; no research is required. You can see some of my code because I'm working on reclassifying my Peaks project as open source and proprietary. I previously tested Apache Arrow to see how it compared to slice, but I saw no improvement in performance. However, I eventually remove slice's in-memory table structure from my project. When supporting ETL functions such as "Distinct", "GroupBy", "Filter", and "JoinTable", the use of bytearray as an in-memory table has been shown to yield significant performance gains. Bytearray reduces memory and CPU usage significantly. It avoids unnecessary dataset serialization and de-serialization. I've been retired for three years as a result of the COVID-2019 layoff exercise. I've recently become very active in programming and marketing because I intend to return to work when I'm now 55 years old. One of my goals is to contribute to the open source community. I'll think more about the benefits of using Arrow for CSV; I believe the main benefit is data exchange. However, gRPC is also an excellent way to support very high data exchange performance over the internet. My strength is innovative system design; when I design something in my head, coding becomes simple. My weakness, however, is reading code. I'll try to replicate what I've learned from https://pkg.go.dev/github.com/apache/arrow/go/v12/arrow/csv#WithChunk. I understand that users want a fast Apache Spark running on their desktop computer. Cloud computing is always expensive and risky when paid with a credit card. As a result, I am unable to use the cloud computing services provided by Databrick, Azure, Google, and AWS. Peaks Consolidation is designed to address this issue by allowing users to process billions of rows on a desktop computer. I will request cloud companies to support prepayment by Paypal. Simplex model is, one prepaid balance utilized, all VMs be removed automatically. Thank you for your detailed reply to my question. I will consider whether it is possible to implement the Parquet format, which can outperform my current CSV format. I noticed that writing a Parquet file requires much more time than writing a CSV file, as I have seen in other software. And I hope Apache Foundation can consider bytearray is one of best data exchange format moving from one software to alternative software. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
