Looking at fixed-size-list memory layout [1] I think we better proceed with
this proposal and rather optimize the parquet reader/writer, e.g.: [2].

Best,
Rok

[1]
https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout
[2] https://github.com/apache/arrow/issues/34510#issuecomment-1463548318

On Wed, Mar 15, 2023 at 8:31 AM Alenka Frim <ale...@voltrondata.com.invalid>
wrote:

> Thank you for the clarification Adam.
>
> All the observations and conclusions you have made are very valuable.
> Certainly, the fact that Parquet reading is not as fast as it could be is a
> consequence of choosing a FixedSizeList type as a storage type for
> the fixed shape tensor extension.
>
> Despite that, we (Joris and I) do not think that it is enough reason to
> change
> the specification. Flat doubles, for example, are not an actual alternative
> but
> something a user could potentially do on the side of the application to get
> a
> faster write/read.
>
> Also we think there is no need to add restrictions on the presence of null
> values and should be up to the application to add (in many cases tensor
> data will not have nulls, since in general array libraries don't support
> those).
>
> I do hope future use of this extension type will not be annoying as you
> have previously experienced with timestamp and timezone conversions.
>
> As you stated this is not a blocking issue I will wait for later today to
> finish this vote and proceed with the spec merge so we can get some
> more attention to the C++ implementation.
>
> Best,
> Alenka
>
> On Mon, Mar 13, 2023 at 2:28 PM Adam Lippai <a...@rigo.sk> wrote:
>
> > Hi Alenka,
> >
> > We didn’t discuss or benchmark the alternative formats. My understanding
> is
> > that the best should be similar to an primitive double Arrow column.
> > Currently the parquet (de)serialization takes 3x longer than desired for
> > the new Tensor type. That sounds more than “chasing the last 20% of
> > performance”.
> >
> > The conversation can be continued separately, the most pressing questions
> > or issues are:
> > 1. We might want to specify that a tensor has to consist of fixed size,
> > non-null and non-nested items to avoid confusion. This is a big
> constraint,
> > however makes it easier to have consistent assumptions and optimize eg
> the
> > parquet storage later. Alternatively we can define a DoubleTensor later
> (or
> > just accept the behavior varies a lot depending on the stored data, even
> > int and double tensor matmul is ridiculously confusing anyways)
> > 2. Adding a fixed byte array based storage for fixedsizedlist with
> > primitives for Arrow<->Parquet conversion is desired to improve the
> > performance. It was still slower than storing doubles, but much better
> than
> > storing the list. We lose the parquet features eg delta encoding, list
> item
> > statistics or bloom filter (this might be already missing for lists, I
> > didn’t check yet)
> > 3. The pandas numpy array is good news. I will confirm if the memory in
> the
> > column is continuous and operations can be vectorized or is it more
> similar
> > to an object storage with individual pointers
> >
> > I don’t think the above are blocking issues. I’ve raised this here only
> > because I remember how annoying the timestamp and timezone conversions
> were
> > (not round tripping with pandas, parquet storage change).
> >
> > P.S. I have almost zero experience with DNNs, but some reference how our
> > layout compares to NCHW or what batch sizes are can be interesting in the
> > docs:
> >
> >
> https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html
> > I guess it’s all doable with the proposed extension.
> >
> > Best regards,
> > Adam Lippai
> >
> >
> > On Mon, Mar 13, 2023 at 4:15 AM Alenka Frim <ale...@voltrondata.com
> > .invalid>
> > wrote:
> >
> > > Hi Adam,
> > >
> > > you are referring to the issue you raised on the Arrow repo [1] that
> > turned
> > > into a good discussion about FixedSizeList and the current conversion
> > > to Parquet.
> > >
> > > Please correct me if I am wrong, but the outcome of the discussion was
> > that
> > > the
> > > conversion is still pretty fast (much faster than commonly used
> > > serialization formats for
> > > tensors) though not as fast compared to other primitives in Apache
> Arrow.
> > >
> > > My opinion is that the discussion on this topic can be opened up
> > > separately in
> > > connection to optimising conversion between FixedSizeList as an Arrow
> > > format
> > > to Parquet, if there is still a need to do so.
> > >
> > > For this canonical extension type I would say it is an implementation
> > > detail
> > > and you mention a way to handle that with Parquet in the issue
> mentioned
> > > [2].
> > >
> > > I do not think there should be any issues in the conversion to Pandas.
> > > The conversion to numpy is not expensive and I would think the
> conversion
> > > to pandas should be the same. See PyArrow illustrative implementation
> > [3].
> > >
> > > [1]: https://github.com/apache/arrow/issues/34510
> > > [2]:
> > https://github.com/apache/arrow/issues/34510#issuecomment-1464463384
> > > [3]:
> > >
> > >
> >
> https://github.com/apache/arrow/pull/33948/files#diff-efc1a41cdf04b6ec96d822dbec1f1993e0bbd17050b1b5f1275c8e3443a38828
> > >
> > > All well,
> > > Alenka
> > >
> > > On Fri, Mar 10, 2023 at 11:32 PM Adam Lippai <a...@rigo.sk> wrote:
> > >
> > > > Since the specification explicitly mentions FixedSizeList, but the
> > > current
> > > > conversion to/from parquet is expensive compared to doubles and other
> > > > primitives (the nested type needs repetition and definition levels)
> > > should
> > > > we discuss what’s the recommendation when integrating with other
> > > non-arrow
> > > > systems or is that an implementation detail only? (Pandas, parquet)
> > > >
> > > > Best regards,
> > > > Adam Lippai
> > > >
> > > > On Wed, Mar 8, 2023 at 1:13 AM Alenka Frim <ale...@voltrondata.com
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > >
> > > > > > Just one comment, though: since we also define a separate
> "Tensor"
> > > IPC
> > > > > > structure in Arrow, maybe we should state the relationship
> > somewhere
> > > in
> > > > > the
> > > > > > documentation? (Even if the answer is "no relationship".)
> > > > > >
> > > > >
> > > > > Agree David, thanks for bringing it up.
> > > > >
> > > > > I will add the information about "no relationship" to the Tensor
> IPC
> > > > > structure into the spec and will also keep in mind to add it to the
> > > > > documentation that follows the implementations.
> > > > >
> > > >
> > >
> >
>

Reply via email to