I'm happy to have a look at the branch / integration tests if you could put up a PR
When you say "a single serialized record batch" you mean an encapsulated message, right (including length prefix and metadata)? Using the terminology from http://arrow.apache.org/docs/ipc.html. I guess the problem is that the total size of the Schema message at the start of the stream may not be a multiple of 8. We should totally fix this; I don't think it even constitutes a breakage of the format -- I am fairly sure with extra padding bytes between the schema and the first record batch (or first dictionary) that the stream will be backwards compatible. However, we should document in https://github.com/apache/arrow/blob/master/format/IPC.md that message sizes are expected to be a multiple of 8. We should also take a look at the File format implementation to ensure that padding is inserted after the magic number at the start of the file - Wes On Tue, Aug 8, 2017 at 1:32 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> wrote: > Sure, the workflow is a little complicated, but we have the following code > running in distributed databases (as accumulo iterators and hbase > coprocessors). They process data rows and transform them into arrow records, > then periodically write out record batches: > > https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L79-L105 > > https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/records/RecordBatchUnloader.scala#L20 > > The record batches come back to a single client and are concatenated with a > file header and footer (then wrapped in SimpleFeature objects, as we > implement a geotools data store): > > https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L265-L268 > > The resulting bytes are written out as an arrow streaming file that we parse > with the arrow-js libraries in the browser. > > Thanks, > > Emilio > > > On 08/08/2017 01:24 PM, Li Jin wrote: >> >> Hi Emilio, >> >>> So I think the issue is that we are serializing record batches in a >> >> distributed fashion, and then > concatenating them in the streaming >> format. >> >> Can you show the code for this? >> >> On Tue, Aug 8, 2017 at 12:35 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> >> wrote: >> >>> So I think the issue is that we are serializing record batches in a >>> distributed fashion, and then concatenating them in the streaming format. >>> However, the message serialization only aligns the start of the buffers, >>> which requires it to know the current absolute offset of the output >>> stream. >>> Would there be any problem with padding the end of the message, so any >>> single serialized record batch would always be a multiple of 8 bytes? >>> >>> I've put together a branch that does this, and the existing java tests >>> all >>> pass. I'm having some trouble running the integration tests though. >>> >>> Thanks, >>> >>> Emilio >>> >>> >>> On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote: >>> >>>> Hi Wes, >>>> >>>> You're right, I just realized that. I think the alignment issue might be >>>> in some unrelated code, actually. From what I can tell the the arrow >>>> writers are aligning buffers correctly; if not I'll open a bug. >>>> >>>> Thanks, >>>> >>>> Emilio >>>> >>>> On 08/08/2017 09:15 AM, Wes McKinney wrote: >>>> >>>>> hi Emilio, >>>>> >>>>> From your description, it isn't clear why 8-byte alignment is causing >>>>> a problem (as compare with 64-byte alignment). My understanding is >>>>> that JavaScript's TypedArray classes range in size from 1 to 8 bytes. >>>>> >>>>> The starting offset for all buffers should be 8-byte aligned, if not >>>>> that is a bug. Could you clarify? >>>>> >>>>> - Wes >>>>> >>>>> On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> >>>>> wrote: >>>>> >>>>>> After looking at it further, I think only the buffers themselves need >>>>>> to be >>>>>> aligned, not the metadata and/or schema. Would there be any problem >>>>>> with >>>>>> changing the alignment to 64 bytes then? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Emilio >>>>>> >>>>>> >>>>>> On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote: >>>>>> >>>>>>> I'm looking into buffer alignment in the java writer classes. >>>>>>> Currently >>>>>>> some files written with the java streaming writer can't be read due >>>>>>> to >>>>>>> the >>>>>>> javascript TypedArray's restriction that the start offset of the >>>>>>> array >>>>>>> must >>>>>>> be a multiple of the data size of the array type (i.e. Int32Vectors >>>>>>> must >>>>>>> start on a multiple of 4, Float64Vectors must start on a multiple of >>>>>>> 8, >>>>>>> etc). From a cursory look at the java writer, I believe that the >>>>>>> schema that >>>>>>> is written first is not aligned at all, and then each record batch >>>>>>> pads out >>>>>>> its size to a multiple of 8. So: >>>>>>> >>>>>>> 1. should the schema block pad itself so that the first record batch >>>>>>> is >>>>>>> aligned, and is there any problem with doing so? >>>>>>> 2. is there any problem with changing the alignment to 64 bytes, as >>>>>>> recommended (but not required) by the spec? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Emilio >>>>>>> >>>>>> >