Re: arrow read/write examples in Java

Emilio Lahr-Vivaz Tue, 19 Dec 2017 08:55:50 -0800

This has probably changed with the Java code refactor, but I've postedsome answers inline, to the best of my understanding.


Thanks,


Emilio

On 12/16/2017 12:17 PM, Animesh Trivedi wrote:

Thanks Wes for you help.

Based upon some code reading, I managed to code-up a basic working example.
The code is here:
https://github.com/animeshtrivedi/ArrowExample/tree/master/src/main/java/com/github/animeshtrivedi/arrowexample
.

However, I do have some questions about the concepts in Arrow

1. ArrowBlock is the unit of reading/writing. One ArrowBlock essentially is
the amount of the data one must hold in-memory at a time. Is my
understanding correct?

yes


2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
classes in the ValueVector interface - both are implemented by all
supported data types. What is the relationship between these two? or when
is one suppose to use one over other. I only use Mutator/Accessor classes
in my code.

The write/reader interfaces are parallel implementations that make somethings easier, but don't encompass all available functionality (forexample, fixed size lists, nested lists, some dictionary operations,etc). However, you should be able to accomplish everything usingmutators/accessors.


3. What are the "safe" varient functions in the Mutator's code? I could not
understand what they meant to achieve.

The safe methods ensure that the vector is large enough to set thevalue. You can use the unsafe versions if you know that your vector hasalready allocated enough space for your data.

4. What are MinorTypes?

Minor types are a representation of the different vector types. Ibelieve they are being de-emphasized in favor of FieldTypes, as minortypes don't contain enough information to represent all vectors.


5. For a writer, what is a dictionary provider? For example in the
Integration.java code, the reader is given as the dictionary provider for
the writer. But, is it something more than just:
DictionaryProvider.MapDictionaryProvider provider = new
DictionaryProvider.MapDictionaryProvider();
ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
fileOutputStream.getChannel());

The dictionary provider is an interface for looking up dictionaryvalues. When reading a file, the reader itself has already read thedictionaries and thus serves as the provider.

6. I am not clearly sure about the sequence of call that one needs to do
write on mutators. For example, if I code something like
NullableIntVector intVector = (NullableIntVector) fieldVector;
NullableIntVector.Mutator mutator = intVector.getMutator();
[.write num values]
mutator.setValueCount(num)
then this works for primitive types, but not for VarBinary type. There I
have to set the capacity first,

NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
fieldVector;
varBinaryVector.setInitialCapacity(items);
varBinaryVector.allocateNew();
NullableVarBinaryVector.Mutator mutator = varBinaryVector.getMutator();

The method calls are not very well documented - I would suggest lookingat the reader/writer implementations to see what calls are required forwhich vector types. Generally variable length vectors (lists, varbinary, etc) behave differently than fixed width vectors (ints, longs, etc).

Example of these are here:
https://github.com/animeshtrivedi/ArrowExample/blob/master/src/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
(writeField[???] functions).

Thank you very much,
--
Animesh



On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com> wrote:

hi Animesh,

I suggest you try the ArrowStreamReader/Writer or
ArrowFileReader/Writer classes. See
https://github.com/apache/arrow/blob/master/java/tools/
src/main/java/org/apache/arrow/tools/Integration.java
for example working code for this

- Wes

On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
<animesh.triv...@gmail.com> wrote:

Hi all,

It might be a trivial question, so please let me know if I am missing
something.

I am trying to write and read files in the Arrow format in Java. My data

is

simple flat schema with primitive types. I already have the data in Java.
So my questions are:
1. Is this possible or am I fundamentally missing something what Arrow

can

or cannot do (or is designed to do). I assume that an efficient in-memory
columnar data format should work with files too.
2. Can you point me out to a working example? or a starting example.
Intuitively I am looking for a way to define schema, write/read column
vectors to/from files as one does with Parquet or ORC.

I try to locate some working examples with ArrowFile[Reader/Writer]

classes

in the maven tests but so far not sure where to start.

Thanks,
--
Animesh

Re: arrow read/write examples in Java

Reply via email to