[GitHub] [arrow-julia] kazuakiyama opened a new issue, #410: Official support for the Apache Parquet format

via GitHub Mon, 27 Mar 2023 14:13:02 -0700


kazuakiyama opened a new issue, #410:
URL: https://github.com/apache/arrow-julia/issues/410


   I'm a radio astronomer interested in using this Julia-native implementation 
of the Apache Arrow in-memory format for black hole imaging with the [Event 
Horizon Telescope](https://eventhorizontelescope.org/). First of all, thanks 
for developing this package! We get interested in this package because the 
Apache Arrow and Parquet formats have been considered as a [major candidate for 
the next generation radio astronomy data 
format](https://github.com/ratt-ru/casa-arrow/discussions/1). 
   
   I'm wondering if the package envisions implementing IO functions of the 
Apache Parquet format in the future. I read a previous [issue]( 
https://github.com/apache/arrow-julia/issues/227) regarding this topic. I 
believe that no method is yet available to directly load/write columnar data in 
Parquest file into the Arrow.jl's in-memory data ---- the only way to handle 
this in a pure Julia way seems to be converting disk-based data into the one in 
the Apache IPC format by using both Parquet.jl and Arrow.jl, and then reloading 
it into memory using Arrow.jl.
   
   This seems to be a bit problematic for our use case appearing as a major 
issue preventing us from using this package and apache's columnar formats in 
Julia. I think the key issues here
   - This sort of disk-based conversion via 
[Parquet.jl](https://github.com/JuliaIO/Parquet.jl) and Arrow.jl is not 
computationally optimal as it involves disk-write and -read. This will be a 
major overhead in our use case.
   - The Apache IPC format is [not prioritizing long-term storage and archival 
usage](https://arrow.apache.org/faq/#what-about-arrow-files-then), which would 
not satisfy the requirements of our community. So, purely relying on the IPC 
format won't be a solution.
   - The current Julia packages for the Apache Parquet format (e.g. 
[Parquet.jl](https://github.com/JuliaIO/Parquet.jl) and 
[Parquet2.jl](https://gitlab.com/ExpandingMan/Parquet2.jl) seem not fully 
support nested types, which are key to handle [our radio astronomy data in the 
Apache's columnar formats](https://github.com/ratt-ru/casa-arrow), while 
Arrow.jl does for the Arrow in-memory and IPC formats.
   
   Given a lot of similarities and cross sections between the specifications of 
the Apache Parquet and Arrow formats, I feel it is more straightforward to 
request the IO features of Parquet formats in Arrow.jl rather than request some 
missing features to the existing Julia Parquet packages. Any thoughts on this 
are appreciated. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-julia] kazuakiyama opened a new issue, #410: Official support for the Apache Parquet format

Reply via email to