Hi Alenka, We didn’t discuss or benchmark the alternative formats. My understanding is that the best should be similar to an primitive double Arrow column. Currently the parquet (de)serialization takes 3x longer than desired for the new Tensor type. That sounds more than “chasing the last 20% of performance”.
The conversation can be continued separately, the most pressing questions or issues are: 1. We might want to specify that a tensor has to consist of fixed size, non-null and non-nested items to avoid confusion. This is a big constraint, however makes it easier to have consistent assumptions and optimize eg the parquet storage later. Alternatively we can define a DoubleTensor later (or just accept the behavior varies a lot depending on the stored data, even int and double tensor matmul is ridiculously confusing anyways) 2. Adding a fixed byte array based storage for fixedsizedlist with primitives for Arrow<->Parquet conversion is desired to improve the performance. It was still slower than storing doubles, but much better than storing the list. We lose the parquet features eg delta encoding, list item statistics or bloom filter (this might be already missing for lists, I didn’t check yet) 3. The pandas numpy array is good news. I will confirm if the memory in the column is continuous and operations can be vectorized or is it more similar to an object storage with individual pointers I don’t think the above are blocking issues. I’ve raised this here only because I remember how annoying the timestamp and timezone conversions were (not round tripping with pandas, parquet storage change). P.S. I have almost zero experience with DNNs, but some reference how our layout compares to NCHW or what batch sizes are can be interesting in the docs: https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html I guess it’s all doable with the proposed extension. Best regards, Adam Lippai On Mon, Mar 13, 2023 at 4:15 AM Alenka Frim <ale...@voltrondata.com.invalid> wrote: > Hi Adam, > > you are referring to the issue you raised on the Arrow repo [1] that turned > into a good discussion about FixedSizeList and the current conversion > to Parquet. > > Please correct me if I am wrong, but the outcome of the discussion was that > the > conversion is still pretty fast (much faster than commonly used > serialization formats for > tensors) though not as fast compared to other primitives in Apache Arrow. > > My opinion is that the discussion on this topic can be opened up > separately in > connection to optimising conversion between FixedSizeList as an Arrow > format > to Parquet, if there is still a need to do so. > > For this canonical extension type I would say it is an implementation > detail > and you mention a way to handle that with Parquet in the issue mentioned > [2]. > > I do not think there should be any issues in the conversion to Pandas. > The conversion to numpy is not expensive and I would think the conversion > to pandas should be the same. See PyArrow illustrative implementation [3]. > > [1]: https://github.com/apache/arrow/issues/34510 > [2]: https://github.com/apache/arrow/issues/34510#issuecomment-1464463384 > [3]: > > https://github.com/apache/arrow/pull/33948/files#diff-efc1a41cdf04b6ec96d822dbec1f1993e0bbd17050b1b5f1275c8e3443a38828 > > All well, > Alenka > > On Fri, Mar 10, 2023 at 11:32 PM Adam Lippai <a...@rigo.sk> wrote: > > > Since the specification explicitly mentions FixedSizeList, but the > current > > conversion to/from parquet is expensive compared to doubles and other > > primitives (the nested type needs repetition and definition levels) > should > > we discuss what’s the recommendation when integrating with other > non-arrow > > systems or is that an implementation detail only? (Pandas, parquet) > > > > Best regards, > > Adam Lippai > > > > On Wed, Mar 8, 2023 at 1:13 AM Alenka Frim <ale...@voltrondata.com > > .invalid> > > wrote: > > > > > > > > > > Just one comment, though: since we also define a separate "Tensor" > IPC > > > > structure in Arrow, maybe we should state the relationship somewhere > in > > > the > > > > documentation? (Even if the answer is "no relationship".) > > > > > > > > > > Agree David, thanks for bringing it up. > > > > > > I will add the information about "no relationship" to the Tensor IPC > > > structure into the spec and will also keep in mind to add it to the > > > documentation that follows the implementations. > > > > > >