I opened https://issues.apache.org/jira/browse/ARROW-1343. Let's try to resolve this today so it can make the 0.6.0 release?
On Tue, Aug 8, 2017 at 2:03 PM, Wes McKinney <wesmck...@gmail.com> wrote: > I'm happy to have a look at the branch / integration tests if you > could put up a PR > > When you say "a single serialized record batch" you mean an > encapsulated message, right (including length prefix and metadata)? > Using the terminology from http://arrow.apache.org/docs/ipc.html. I > guess the problem is that the total size of the Schema message at the > start of the stream may not be a multiple of 8. We should totally fix > this; I don't think it even constitutes a breakage of the format -- I > am fairly sure with extra padding bytes between the schema and the > first record batch (or first dictionary) that the stream will be > backwards compatible. > > However, we should document in > https://github.com/apache/arrow/blob/master/format/IPC.md that message > sizes are expected to be a multiple of 8. We should also take a look > at the File format implementation to ensure that padding is inserted > after the magic number at the start of the file > > - Wes > > On Tue, Aug 8, 2017 at 1:32 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> wrote: >> Sure, the workflow is a little complicated, but we have the following code >> running in distributed databases (as accumulo iterators and hbase >> coprocessors). They process data rows and transform them into arrow records, >> then periodically write out record batches: >> >> https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L79-L105 >> >> https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/records/RecordBatchUnloader.scala#L20 >> >> The record batches come back to a single client and are concatenated with a >> file header and footer (then wrapped in SimpleFeature objects, as we >> implement a geotools data store): >> >> https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L265-L268 >> >> The resulting bytes are written out as an arrow streaming file that we parse >> with the arrow-js libraries in the browser. >> >> Thanks, >> >> Emilio >> >> >> On 08/08/2017 01:24 PM, Li Jin wrote: >>> >>> Hi Emilio, >>> >>>> So I think the issue is that we are serializing record batches in a >>> >>> distributed fashion, and then > concatenating them in the streaming >>> format. >>> >>> Can you show the code for this? >>> >>> On Tue, Aug 8, 2017 at 12:35 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> >>> wrote: >>> >>>> So I think the issue is that we are serializing record batches in a >>>> distributed fashion, and then concatenating them in the streaming format. >>>> However, the message serialization only aligns the start of the buffers, >>>> which requires it to know the current absolute offset of the output >>>> stream. >>>> Would there be any problem with padding the end of the message, so any >>>> single serialized record batch would always be a multiple of 8 bytes? >>>> >>>> I've put together a branch that does this, and the existing java tests >>>> all >>>> pass. I'm having some trouble running the integration tests though. >>>> >>>> Thanks, >>>> >>>> Emilio >>>> >>>> >>>> On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote: >>>> >>>>> Hi Wes, >>>>> >>>>> You're right, I just realized that. I think the alignment issue might be >>>>> in some unrelated code, actually. From what I can tell the the arrow >>>>> writers are aligning buffers correctly; if not I'll open a bug. >>>>> >>>>> Thanks, >>>>> >>>>> Emilio >>>>> >>>>> On 08/08/2017 09:15 AM, Wes McKinney wrote: >>>>> >>>>>> hi Emilio, >>>>>> >>>>>> From your description, it isn't clear why 8-byte alignment is causing >>>>>> a problem (as compare with 64-byte alignment). My understanding is >>>>>> that JavaScript's TypedArray classes range in size from 1 to 8 bytes. >>>>>> >>>>>> The starting offset for all buffers should be 8-byte aligned, if not >>>>>> that is a bug. Could you clarify? >>>>>> >>>>>> - Wes >>>>>> >>>>>> On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> >>>>>> wrote: >>>>>> >>>>>>> After looking at it further, I think only the buffers themselves need >>>>>>> to be >>>>>>> aligned, not the metadata and/or schema. Would there be any problem >>>>>>> with >>>>>>> changing the alignment to 64 bytes then? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Emilio >>>>>>> >>>>>>> >>>>>>> On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote: >>>>>>> >>>>>>>> I'm looking into buffer alignment in the java writer classes. >>>>>>>> Currently >>>>>>>> some files written with the java streaming writer can't be read due >>>>>>>> to >>>>>>>> the >>>>>>>> javascript TypedArray's restriction that the start offset of the >>>>>>>> array >>>>>>>> must >>>>>>>> be a multiple of the data size of the array type (i.e. Int32Vectors >>>>>>>> must >>>>>>>> start on a multiple of 4, Float64Vectors must start on a multiple of >>>>>>>> 8, >>>>>>>> etc). From a cursory look at the java writer, I believe that the >>>>>>>> schema that >>>>>>>> is written first is not aligned at all, and then each record batch >>>>>>>> pads out >>>>>>>> its size to a multiple of 8. So: >>>>>>>> >>>>>>>> 1. should the schema block pad itself so that the first record batch >>>>>>>> is >>>>>>>> aligned, and is there any problem with doing so? >>>>>>>> 2. is there any problem with changing the alignment to 64 bytes, as >>>>>>>> recommended (but not required) by the spec? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Emilio >>>>>>>> >>>>>>> >>