martin-traverse commented on issue #794:
URL: https://github.com/apache/arrow-java/issues/794#issuecomment-3047867486
Hm - I have found ArrowReader / ArrowWriter to be a bit opinionated.
Particularly an issue with reader because you have less control (when bytes are
going to arrive) but also writer makes a lot of assumptions e.g. dictionaries
are static, no deltas etc. Neither gives you much control of what is happening
at the file level. I ended up overriding and shading some of the supporting
classes in order to use them (I have patches to submit which could reduce the
need for this).
I think the goal with ArrowReader / ArrowWriter was to abstract away the
different formats (file and stream), so perhaps that abstraction is partly what
causes the loss of control. My preference would be for a more explicit API that
directly maps to the file structure, it is always possible to layer
generalisation on top for a specific pattern.
Here is a quick summary of what is the same / different:
Writer:
/// These can be the same as ArrowWriter
void writeBatch();
long bytesWritten();
void close;();
// These are different
void writeHeader(); // Explicit control instead of start() / end(),
which may or may not trigger writes
void resetBatch(VectorSchemaRoot batch); // Allow streaming pattern
Reader:
// These are the same as ArrowReader:
VectorSchemaRoot getVectorSchemaRoot();
long bytesRead();
void close();
// These are also the same, from DictionaryProvider
Set<Long> getDictionaryIds();
Dictionary lookup(long id);
// These are different - Arrow has readSchema() and initialize(), which
may or may not trigger reads
void readHeader(); // Explicit read control
Schema getSchema(); // Explicit get - does not trigger reading
// These are aloo different - explicit control over reading
boolean readBatch();
boolean hasNextBatch();
long nextBatchPosition();
long nextBatchSize();
One minor detail on naming - ArrowReader has `loadNaextBatch()` which is
possibly more descriptive, but I think there is value in sticking with just one
scheme - Load / save, read / write, get / put etc. otherwise it gets confusing.
Happy to take a steer if you feel differently on any of this. Otherwise if
you are happy lmk and I'll make a start :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]