This has probably changed with the Java code refactor, but I've posted some answers inline, to the best of my understanding.

Thanks,

Emilio

On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
Thanks Wes for you help.

Based upon some code reading, I managed to code-up a basic working example.
The code is here:
https://github.com/animeshtrivedi/ArrowExample/tree/master/src/main/java/com/github/animeshtrivedi/arrowexample
.

However, I do have some questions about the concepts in Arrow

1. ArrowBlock is the unit of reading/writing. One ArrowBlock essentially is
the amount of the data one must hold in-memory at a time. Is my
understanding correct?
yes

2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
classes in the ValueVector interface - both are implemented by all
supported data types. What is the relationship between these two? or when
is one suppose to use one over other. I only use Mutator/Accessor classes
in my code.
The write/reader interfaces are parallel implementations that make some things easier, but don't encompass all available functionality (for example, fixed size lists, nested lists, some dictionary operations, etc). However, you should be able to accomplish everything using mutators/accessors.

3. What are the "safe" varient functions in the Mutator's code? I could not
understand what they meant to achieve.
The safe methods ensure that the vector is large enough to set the value. You can use the unsafe versions if you know that your vector has already allocated enough space for your data.
4. What are MinorTypes?
Minor types are a representation of the different vector types. I believe they are being de-emphasized in favor of FieldTypes, as minor types don't contain enough information to represent all vectors.

5. For a writer, what is a dictionary provider? For example in the
Integration.java code, the reader is given as the dictionary provider for
the writer. But, is it something more than just:
DictionaryProvider.MapDictionaryProvider provider = new
DictionaryProvider.MapDictionaryProvider();
ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
fileOutputStream.getChannel());
The dictionary provider is an interface for looking up dictionary values. When reading a file, the reader itself has already read the dictionaries and thus serves as the provider.
6. I am not clearly sure about the sequence of call that one needs to do
write on mutators. For example, if I code something like
NullableIntVector intVector = (NullableIntVector) fieldVector;
NullableIntVector.Mutator mutator = intVector.getMutator();
[.write num values]
mutator.setValueCount(num)
then this works for primitive types, but not for VarBinary type. There I
have to set the capacity first,

NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
fieldVector;
varBinaryVector.setInitialCapacity(items);
varBinaryVector.allocateNew();
NullableVarBinaryVector.Mutator mutator = varBinaryVector.getMutator();
The method calls are not very well documented - I would suggest looking at the reader/writer implementations to see what calls are required for which vector types. Generally variable length vectors (lists, var binary, etc) behave differently than fixed width vectors (ints, longs, etc).
Example of these are here:
https://github.com/animeshtrivedi/ArrowExample/blob/master/src/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
(writeField[???] functions).

Thank you very much,
--
Animesh



On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com> wrote:

hi Animesh,

I suggest you try the ArrowStreamReader/Writer or
ArrowFileReader/Writer classes. See
https://github.com/apache/arrow/blob/master/java/tools/
src/main/java/org/apache/arrow/tools/Integration.java
for example working code for this

- Wes

On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
<animesh.triv...@gmail.com> wrote:
Hi all,

It might be a trivial question, so please let me know if I am missing
something.

I am trying to write and read files in the Arrow format in Java. My data
is
simple flat schema with primitive types. I already have the data in Java.
So my questions are:
1. Is this possible or am I fundamentally missing something what Arrow
can
or cannot do (or is designed to do). I assume that an efficient in-memory
columnar data format should work with files too.
2. Can you point me out to a working example? or a starting example.
Intuitively I am looking for a way to define schema, write/read column
vectors to/from files as one does with Parquet or ORC.

I try to locate some working examples with ArrowFile[Reader/Writer]
classes
in the maven tests but so far not sure where to start.

Thanks,
--
Animesh

Reply via email to