Yuan-Ru-Lin commented on issue #434:
URL: https://github.com/apache/arrow-julia/issues/434#issuecomment-3093807424
> Is there a way to get the batch-offset table with Arrow.jl, if the data is
written in "file" mode?
Yes.
Consider `test.arrow` generated by the following script.
```julia
using Arrow
using TypedTables
using Tables
t = Table(
a=collect(1:10_000),
b=rand(Float32, 10_000),
c=rand(ComplexF32, 10_000),
)
# This would produce 10 RecordBatches
Arrow.write("test.arrow", Tables.partitioner(Iterators.partition(t, 1_000)))
```
Then one can get the indices of all the `RecordBatch`es by `read`ing the
relevant bytes and parsing them using
`Arrow.FlatBuffers.getrootas(Arrow.Meta.Footer, _footerbytes, 0)`
```julia
using Arrow
f = open("test.arrow")
# Check whether the magic number is there
seekend(f)
seek(f, position(f) - 6)
@assert String(read(f, 6)) == "ARROW1"
# Fetch footer size
seekend(f)
seek(f, position(f) - 6 - 4)
footersize = only(reinterpret(Int32, read(f, 4)))
@assert footersize == 560
# Fetch footer
seekend(f)
seek(f, position(f) - 6 - 4 - 560)
_footerbytes = read(f, 560)
_footer = Arrow.FlatBuffers.getrootas(Arrow.Meta.Footer, _footerbytes, 0)
"""
julia> _footer.recordBatches
10-element Arrow.FlatBuffers.Array{Arrow.Flatbuf.Block, NTuple{24, UInt8},
Arrow.Flatbuf.Footer}:
Arrow.Flatbuf.Block(offset = 320, metaDataLength = 320, bodyLength = 20000)
Arrow.Flatbuf.Block(offset = 20640, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 40960, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 61280, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 81600, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 101920, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 122240, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 142560, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 162880, metaDataLength = 320, bodyLength =
20000)
Arrow.Flatbuf.Block(offset = 183200, metaDataLength = 320, bodyLength =
20000)
"""
# Sanity check: fetch the first column in the first block using the above
information
seek(f, 320 + 320)
block1data = read(f, 20000)
reinterpret(Int64, block1data[1:8000])
"""
julia> reinterpret(Int64, block1data[1:8000])
1000-element reinterpret(Int64, ::Vector{UInt8}):
1
2
3
4
(omitted)
"""
```
I accessed the first batch but in principle one can access to whichever
block without reading others.
In order to come up with an API, I still need to know how to parse bytes
that make up a `RecordBatch`.
By the way, this might provide a way to closing
https://github.com/apache/arrow-julia/issues/353
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]