[ 
https://issues.apache.org/jira/browse/ARROW-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225520#comment-16225520
 ] 

ASF GitHub Bot commented on ARROW-1047:
---------------------------------------

BryanCutler commented on a change in pull request #1259: ARROW-1047: [Java] Add 
Generic Reader Interface for Stream Format
URL: https://github.com/apache/arrow/pull/1259#discussion_r147797804
 
 

 ##########
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowReader.java
 ##########
 @@ -216,9 +171,32 @@ private void initialize() throws IOException {
     this.root = new VectorSchemaRoot(schema, vectors, 0);
     this.loader = new VectorLoader(root);
     this.dictionaries = Collections.unmodifiableMap(dictionaries);
+
+    // Read and load all dictionaries from schema
+    for (int i = 0; i < dictionaries.size(); i++) {
 
 Review comment:
   Yeah, we could still do that.  I think it just comes down to either reading 
the dictionaries after the schema, or reading them before the first data batch. 
 I thought it made a little more sense to read them with the schema, otherwise 
the user could create the reader, load the schema and try to decode it but fail.
   
   Would it work for you to maybe overload `ArrowReader.readSchema` which will 
be able to return the original schema before loading the dictionaries?  
Similarly, if using the stream format, you could make a subclass of 
`MessageReader` (introduced here) and react after reading a schema message. If 
not, I'm ok with reading them before data batches and documenting for the user 
that you can't decode until batches are read.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add generalized stream writer and reader interfaces that are decoupled 
> from IO / message framing
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1047
>                 URL: https://issues.apache.org/jira/browse/ARROW-1047
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Java - Vectors
>            Reporter: Wes McKinney
>            Assignee: Bryan Cutler
>              Labels: pull-request-available
>
> cc [~julienledem] [~elahrvivaz] [~nongli]
> The ArrowWriter 
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java
>  accepts a WriteableByteChannel where the stream is written
> It would be useful to be able to support other kinds of message framing and 
> transport, like GRPC or HTTP. So rather than writing a complete Arrow stream 
> as a single contiguous byte stream, the component messages (schema, 
> dictionaries, and record batches) would be framed as separate messages in the 
> underlying protocol. 
> So if we were using ProtocolBuffers and gRPC as the underlying transport for 
> the stream, we could encapsulate components of an Arrow stream in objects 
> like:
> {code:language=protobuf}
> message ArrowMessagePB {
>   required bytes serialized_data;
> }
> {code}
> If the transport supports zero copy, that is obviously better than 
> serializing then parsing a protocol buffer.
> We should do this work in C++ as well to support more flexible stream 
> transport. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to