[GitHub] [arrow] zeroshade commented on issue #35688: How to approach to implement Parquet-Go file format

via GitHub Tue, 23 May 2023 07:51:08 -0700


zeroshade commented on issue #35688:
URL: https://github.com/apache/arrow/issues/35688#issuecomment-1559591346


   @hkpeaks I'm not sure what you mean by failing to build a class library with 
go. You can easily build a shared library with extern "C" functions by using 
cgo and the `buildmode` options (see 
[docs](https://pkg.go.dev/cmd/go#hdr-Build_modes)).
   
   > Based on numerous experiments, Golang readat has achieved the results of 
mmap. As a result, I am much more concerned with performance than with the name 
"mmap."
   
   That's fine, performance is what's important. Just please don't call it 
"mmap" if it's not actually using "mmap" as that confuses what you're trying to 
do. There's plenty of situations where mmap isn't necessary or might even slow 
down performance rather than improve it.
   
   > Bytearray reduces memory and CPU usage significantly. It avoids 
unnecessary dataset serialization and de-serialization.
   
   This is interesting to me and I'd like to see how that is the case. Wouldn't 
you need to serialize/de-serialize from bytes into something you can actually 
process like the various integral types, float data, etc?
   
   >  I'll think more about the benefits of using Arrow for CSV; I believe the 
main benefit is data exchange.
   
   It's not so much the benefits of using Arrow for CSV, but rather getting CSV 
data into Arrow format so that other processes/exchange/analytics can be run on 
it. The current CSV parsing/reading in the Go Arrow lib is very naive and 
doesn't do any parallelization, so is ripe to be improved. I haven't had the 
time to do so myself but it would be fantastic to see contributions there from 
the community.
   
   > However, gRPC is also an excellent way to support very high data exchange 
performance over the internet.
   
   I wholeheartedly agree there, this is why [Arrow Flight 
RPC](https://arrow.apache.org/docs/format/Flight.html) uses gRPC.
   
   > I will consider whether it is possible to implement the Parquet format, 
which can outperform my current CSV format.
   
   What do you mean by implementing the Parquet format in this case? The Go 
Parquet library here already has implemented the Parquet format spec. Are you 
intending to re-implement the parquet spec? Or just use the library provided 
here to perform the reads (and if you find inefficiencies, then contribute 
improvements back)?
   
   > And I hope Apache Foundation can consider bytearray is one of best data 
exchange format moving from one software to alternative software.
   
   By "bytearray" do you mean just a literal array of bytes? Or is there an 
actual data format called "bytearray"? I'm not quite sure what you're referring 
to here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zeroshade commented on issue #35688: How to approach to implement Parquet-Go file format

Reply via email to