It would be nice to have an API to look at the file footer (we don't
have one in C++ either), I opened

https://issues.apache.org/jira/browse/ARROW-3283
On Fri, Sep 21, 2018 at 10:32 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> Hi Michael,
>
> I think ArrowFileReader takes SeekableByteChannel so it's possible to only
> read the metadata for each record batches and skip the data. However it is
> not implemented.
>
> If the input Channel is not seekable (for example, a socket channel) then
> you would need to read the body for each record batches to get the next
> batch, so my hunch is that the performance will be similar whether you read
> record batch body into VectorSchemaRoot or just read the bytes.
>
> If you don't assume your input data is always going to be seekable, I am
> not sure there is a quicker way to do this.
>
>
>
> On Fri, Sep 21, 2018 at 9:33 AM Michael Knopf <mkn...@rapidminer.com> wrote:
>
> > Hi all,
> >
> > I am looking for a quick way to look up the total row count of a data set
> > stored in Arrow’s random access file format using the Java API. Basically,
> > a quicker way to do this:
> >
> > // The reader is in an instance of ArrowFileReader
> > List<ArrowBlock> blocks = reader.getRecordBlocks();
> > int nRows = 0;
> > for (ArrowBlock block : blocks) {
> >     reader.loadRecordBatch(block);
> >     nRows += root.getRowCount();
> > }
> >
> > My understanding is that the above snippets loads the entire data set
> > instead of just the block headers.
> >
> > To give you some context, I am looking into using Arrow for IPC between a
> > JVM and a Python interpreter using a custom data format and PyArrow/Pandas
> > respectively. While the streaming API might be a better tool for this job,
> > I started out with using files to keep things simple.
> >
> > Any help would be greatly appreciated – maybe I just missed the right bit
> > of documentation.
> >
> > Thanks,
> > Michael

Reply via email to