Hi Emilio, > So I think the issue is that we are serializing record batches in a distributed fashion, and then > concatenating them in the streaming format.
Can you show the code for this? On Tue, Aug 8, 2017 at 12:35 PM, Emilio Lahr-Vivaz <[email protected]> wrote: > So I think the issue is that we are serializing record batches in a > distributed fashion, and then concatenating them in the streaming format. > However, the message serialization only aligns the start of the buffers, > which requires it to know the current absolute offset of the output stream. > Would there be any problem with padding the end of the message, so any > single serialized record batch would always be a multiple of 8 bytes? > > I've put together a branch that does this, and the existing java tests all > pass. I'm having some trouble running the integration tests though. > > Thanks, > > Emilio > > > On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote: > >> Hi Wes, >> >> You're right, I just realized that. I think the alignment issue might be >> in some unrelated code, actually. From what I can tell the the arrow >> writers are aligning buffers correctly; if not I'll open a bug. >> >> Thanks, >> >> Emilio >> >> On 08/08/2017 09:15 AM, Wes McKinney wrote: >> >>> hi Emilio, >>> >>> From your description, it isn't clear why 8-byte alignment is causing >>> a problem (as compare with 64-byte alignment). My understanding is >>> that JavaScript's TypedArray classes range in size from 1 to 8 bytes. >>> >>> The starting offset for all buffers should be 8-byte aligned, if not >>> that is a bug. Could you clarify? >>> >>> - Wes >>> >>> On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz <[email protected]> >>> wrote: >>> >>>> After looking at it further, I think only the buffers themselves need >>>> to be >>>> aligned, not the metadata and/or schema. Would there be any problem with >>>> changing the alignment to 64 bytes then? >>>> >>>> Thanks, >>>> >>>> Emilio >>>> >>>> >>>> On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote: >>>> >>>>> I'm looking into buffer alignment in the java writer classes. Currently >>>>> some files written with the java streaming writer can't be read due to >>>>> the >>>>> javascript TypedArray's restriction that the start offset of the array >>>>> must >>>>> be a multiple of the data size of the array type (i.e. Int32Vectors >>>>> must >>>>> start on a multiple of 4, Float64Vectors must start on a multiple of 8, >>>>> etc). From a cursory look at the java writer, I believe that the >>>>> schema that >>>>> is written first is not aligned at all, and then each record batch >>>>> pads out >>>>> its size to a multiple of 8. So: >>>>> >>>>> 1. should the schema block pad itself so that the first record batch is >>>>> aligned, and is there any problem with doing so? >>>>> 2. is there any problem with changing the alignment to 64 bytes, as >>>>> recommended (but not required) by the spec? >>>>> >>>>> Thanks, >>>>> >>>>> Emilio >>>>> >>>> >>>> >> >
