Re: buffer alignment (format/java/js)

Emilio Lahr-Vivaz Tue, 08 Aug 2017 11:12:00 -0700

I opened a PR for comments here: https://github.com/apache/arrow/pull/954

So far I've just added some padding to the end of the buffers. Iprobably will need to pad the schema out as well... I'm testing thingsout against the js code at the moment.


On 08/08/2017 02:06 PM, Wes McKinney wrote:

I opened https://issues.apache.org/jira/browse/ARROW-1343. Let's try
to resolve this today so it can make the 0.6.0 release?

On Tue, Aug 8, 2017 at 2:03 PM, Wes McKinney <wesmck...@gmail.com> wrote:

I'm happy to have a look at the branch / integration tests if you
could put up a PR

When you say "a single serialized record batch" you mean an
encapsulated message, right (including length prefix and metadata)?
Using the terminology from http://arrow.apache.org/docs/ipc.html. I
guess the problem is that the total size of the Schema message at the
start of the stream may not be a multiple of 8. We should totally fix
this; I don't think it even constitutes a breakage of the format -- I
am fairly sure with extra padding bytes between the schema and the
first record batch (or first dictionary) that the stream will be
backwards compatible.

However, we should document in
https://github.com/apache/arrow/blob/master/format/IPC.md that message
sizes are expected to be a multiple of 8. We should also take a look
at the File format implementation to ensure that padding is inserted
after the magic number at the start of the file

- Wes

On Tue, Aug 8, 2017 at 1:32 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> wrote:

Sure, the workflow is a little complicated, but we have the following code
running in distributed databases (as accumulo iterators and hbase
coprocessors). They process data rows and transform them into arrow records,
then periodically write out record batches:

https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L79-L105

https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/records/RecordBatchUnloader.scala#L20

The record batches come back to a single client and are concatenated with a
file header and footer (then wrapped in SimpleFeature objects, as we
implement a geotools data store):

https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L265-L268

The resulting bytes are written out as an arrow streaming file that we parse
with the arrow-js libraries in the browser.

Thanks,

Emilio


On 08/08/2017 01:24 PM, Li Jin wrote:

Hi Emilio,

So I think the issue is that we are serializing record batches in a

distributed fashion, and then > concatenating them in the streaming
format.

Can you show the code for this?

On Tue, Aug 8, 2017 at 12:35 PM, Emilio Lahr-Vivaz <elahrvi...@ccri.com>
wrote:

So I think the issue is that we are serializing record batches in a
distributed fashion, and then concatenating them in the streaming format.
However, the message serialization only aligns the start of the buffers,
which requires it to know the current absolute offset of the output
stream.
Would there be any problem with padding the end of the message, so any
single serialized record batch would always be a multiple of 8 bytes?

I've put together a branch that does this, and the existing java tests
all
pass. I'm having some trouble running the integration tests though.

Thanks,

Emilio


On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote:

Hi Wes,

You're right, I just realized that. I think the alignment issue might be
in some unrelated code, actually. From what I can tell the the arrow
writers are aligning buffers correctly; if not I'll open a bug.

Thanks,

Emilio

On 08/08/2017 09:15 AM, Wes McKinney wrote:

hi Emilio,

   From your description, it isn't clear why 8-byte alignment is causing
a problem (as compare with 64-byte alignment). My understanding is
that JavaScript's TypedArray classes range in size from 1 to 8 bytes.

The starting offset for all buffers should be 8-byte aligned, if not
that is a bug. Could you clarify?

- Wes

On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com>
wrote:

After looking at it further, I think only the buffers themselves need
to be
aligned, not the metadata and/or schema. Would there be any problem
with
changing the alignment to 64 bytes then?

Thanks,

Emilio


On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote:

I'm looking into buffer alignment in the java writer classes.
Currently
some files written with the java streaming writer can't be read due
to
the
javascript TypedArray's restriction that the start offset of the
array
must
be a multiple of the data size of the array type (i.e. Int32Vectors
must
start on a multiple of 4, Float64Vectors must start on a multiple of
8,
etc). From a cursory look at the java writer, I believe that the
schema that
is written first is not aligned at all, and then each record batch
pads out
its size to a multiple of 8. So:

1. should the schema block pad itself so that the first record batch
is
aligned, and is there any problem with doing so?
2. is there any problem with changing the alignment to 64 bytes, as
recommended (but not required) by the spec?

Thanks,

Emilio

Re: buffer alignment (format/java/js)

Reply via email to