I'm happy to have a look at the branch / integration tests if you
could put up a PR

When you say "a single serialized record batch" you mean an
encapsulated message, right (including length prefix and metadata)?
Using the terminology from http://arrow.apache.org/docs/ipc.html. I
guess the problem is that the total size of the Schema message at the
start of the stream may not be a multiple of 8. We should totally fix
this; I don't think it even constitutes a breakage of the format -- I
am fairly sure with extra padding bytes between the schema and the
first record batch (or first dictionary) that the stream will be
backwards compatible.

However, we should document in
https://github.com/apache/arrow/blob/master/format/IPC.md that message
sizes are expected to be a multiple of 8. We should also take a look
at the File format implementation to ensure that padding is inserted
after the magic number at the start of the file

- Wes

On Tue, Aug 8, 2017 at 1:32 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> wrote:
> Sure, the workflow is a little complicated, but we have the following code
> running in distributed databases (as accumulo iterators and hbase
> coprocessors). They process data rows and transform them into arrow records,
> then periodically write out record batches:
>
> https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L79-L105
>
> https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/records/RecordBatchUnloader.scala#L20
>
> The record batches come back to a single client and are concatenated with a
> file header and footer (then wrapped in SimpleFeature objects, as we
> implement a geotools data store):
>
> https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L265-L268
>
> The resulting bytes are written out as an arrow streaming file that we parse
> with the arrow-js libraries in the browser.
>
> Thanks,
>
> Emilio
>
>
> On 08/08/2017 01:24 PM, Li Jin wrote:
>>
>> Hi Emilio,
>>
>>> So I think the issue is that we are serializing record batches in a
>>
>> distributed fashion, and then > concatenating them in the streaming
>> format.
>>
>> Can you show the code for this?
>>
>> On Tue, Aug 8, 2017 at 12:35 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com>
>> wrote:
>>
>>> So I think the issue is that we are serializing record batches in a
>>> distributed fashion, and then concatenating them in the streaming format.
>>> However, the message serialization only aligns the start of the buffers,
>>> which requires it to know the current absolute offset of the output
>>> stream.
>>> Would there be any problem with padding the end of the message, so any
>>> single serialized record batch would always be a multiple of 8 bytes?
>>>
>>> I've put together a branch that does this, and the existing java tests
>>> all
>>> pass. I'm having some trouble running the integration tests though.
>>>
>>> Thanks,
>>>
>>> Emilio
>>>
>>>
>>> On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote:
>>>
>>>> Hi Wes,
>>>>
>>>> You're right, I just realized that. I think the alignment issue might be
>>>> in some unrelated code, actually. From what I can tell the the arrow
>>>> writers are aligning buffers correctly; if not I'll open a bug.
>>>>
>>>> Thanks,
>>>>
>>>> Emilio
>>>>
>>>> On 08/08/2017 09:15 AM, Wes McKinney wrote:
>>>>
>>>>> hi Emilio,
>>>>>
>>>>>   From your description, it isn't clear why 8-byte alignment is causing
>>>>> a problem (as compare with 64-byte alignment). My understanding is
>>>>> that JavaScript's TypedArray classes range in size from 1 to 8 bytes.
>>>>>
>>>>> The starting offset for all buffers should be 8-byte aligned, if not
>>>>> that is a bug. Could you clarify?
>>>>>
>>>>> - Wes
>>>>>
>>>>> On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com>
>>>>> wrote:
>>>>>
>>>>>> After looking at it further, I think only the buffers themselves need
>>>>>> to be
>>>>>> aligned, not the metadata and/or schema. Would there be any problem
>>>>>> with
>>>>>> changing the alignment to 64 bytes then?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Emilio
>>>>>>
>>>>>>
>>>>>> On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote:
>>>>>>
>>>>>>> I'm looking into buffer alignment in the java writer classes.
>>>>>>> Currently
>>>>>>> some files written with the java streaming writer can't be read due
>>>>>>> to
>>>>>>> the
>>>>>>> javascript TypedArray's restriction that the start offset of the
>>>>>>> array
>>>>>>> must
>>>>>>> be a multiple of the data size of the array type (i.e. Int32Vectors
>>>>>>> must
>>>>>>> start on a multiple of 4, Float64Vectors must start on a multiple of
>>>>>>> 8,
>>>>>>> etc). From a cursory look at the java writer, I believe that the
>>>>>>> schema that
>>>>>>> is written first is not aligned at all, and then each record batch
>>>>>>> pads out
>>>>>>> its size to a multiple of 8. So:
>>>>>>>
>>>>>>> 1. should the schema block pad itself so that the first record batch
>>>>>>> is
>>>>>>> aligned, and is there any problem with doing so?
>>>>>>> 2. is there any problem with changing the alignment to 64 bytes, as
>>>>>>> recommended (but not required) by the spec?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Emilio
>>>>>>>
>>>>>>
>

Reply via email to