Looking at fixed-size-list memory layout [1] I think we better proceed with this proposal and rather optimize the parquet reader/writer, e.g.: [2].
Best, Rok [1] https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout [2] https://github.com/apache/arrow/issues/34510#issuecomment-1463548318 On Wed, Mar 15, 2023 at 8:31 AM Alenka Frim <ale...@voltrondata.com.invalid> wrote: > Thank you for the clarification Adam. > > All the observations and conclusions you have made are very valuable. > Certainly, the fact that Parquet reading is not as fast as it could be is a > consequence of choosing a FixedSizeList type as a storage type for > the fixed shape tensor extension. > > Despite that, we (Joris and I) do not think that it is enough reason to > change > the specification. Flat doubles, for example, are not an actual alternative > but > something a user could potentially do on the side of the application to get > a > faster write/read. > > Also we think there is no need to add restrictions on the presence of null > values and should be up to the application to add (in many cases tensor > data will not have nulls, since in general array libraries don't support > those). > > I do hope future use of this extension type will not be annoying as you > have previously experienced with timestamp and timezone conversions. > > As you stated this is not a blocking issue I will wait for later today to > finish this vote and proceed with the spec merge so we can get some > more attention to the C++ implementation. > > Best, > Alenka > > On Mon, Mar 13, 2023 at 2:28 PM Adam Lippai <a...@rigo.sk> wrote: > > > Hi Alenka, > > > > We didn’t discuss or benchmark the alternative formats. My understanding > is > > that the best should be similar to an primitive double Arrow column. > > Currently the parquet (de)serialization takes 3x longer than desired for > > the new Tensor type. That sounds more than “chasing the last 20% of > > performance”. > > > > The conversation can be continued separately, the most pressing questions > > or issues are: > > 1. We might want to specify that a tensor has to consist of fixed size, > > non-null and non-nested items to avoid confusion. This is a big > constraint, > > however makes it easier to have consistent assumptions and optimize eg > the > > parquet storage later. Alternatively we can define a DoubleTensor later > (or > > just accept the behavior varies a lot depending on the stored data, even > > int and double tensor matmul is ridiculously confusing anyways) > > 2. Adding a fixed byte array based storage for fixedsizedlist with > > primitives for Arrow<->Parquet conversion is desired to improve the > > performance. It was still slower than storing doubles, but much better > than > > storing the list. We lose the parquet features eg delta encoding, list > item > > statistics or bloom filter (this might be already missing for lists, I > > didn’t check yet) > > 3. The pandas numpy array is good news. I will confirm if the memory in > the > > column is continuous and operations can be vectorized or is it more > similar > > to an object storage with individual pointers > > > > I don’t think the above are blocking issues. I’ve raised this here only > > because I remember how annoying the timestamp and timezone conversions > were > > (not round tripping with pandas, parquet storage change). > > > > P.S. I have almost zero experience with DNNs, but some reference how our > > layout compares to NCHW or what batch sizes are can be interesting in the > > docs: > > > > > https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html > > I guess it’s all doable with the proposed extension. > > > > Best regards, > > Adam Lippai > > > > > > On Mon, Mar 13, 2023 at 4:15 AM Alenka Frim <ale...@voltrondata.com > > .invalid> > > wrote: > > > > > Hi Adam, > > > > > > you are referring to the issue you raised on the Arrow repo [1] that > > turned > > > into a good discussion about FixedSizeList and the current conversion > > > to Parquet. > > > > > > Please correct me if I am wrong, but the outcome of the discussion was > > that > > > the > > > conversion is still pretty fast (much faster than commonly used > > > serialization formats for > > > tensors) though not as fast compared to other primitives in Apache > Arrow. > > > > > > My opinion is that the discussion on this topic can be opened up > > > separately in > > > connection to optimising conversion between FixedSizeList as an Arrow > > > format > > > to Parquet, if there is still a need to do so. > > > > > > For this canonical extension type I would say it is an implementation > > > detail > > > and you mention a way to handle that with Parquet in the issue > mentioned > > > [2]. > > > > > > I do not think there should be any issues in the conversion to Pandas. > > > The conversion to numpy is not expensive and I would think the > conversion > > > to pandas should be the same. See PyArrow illustrative implementation > > [3]. > > > > > > [1]: https://github.com/apache/arrow/issues/34510 > > > [2]: > > https://github.com/apache/arrow/issues/34510#issuecomment-1464463384 > > > [3]: > > > > > > > > > https://github.com/apache/arrow/pull/33948/files#diff-efc1a41cdf04b6ec96d822dbec1f1993e0bbd17050b1b5f1275c8e3443a38828 > > > > > > All well, > > > Alenka > > > > > > On Fri, Mar 10, 2023 at 11:32 PM Adam Lippai <a...@rigo.sk> wrote: > > > > > > > Since the specification explicitly mentions FixedSizeList, but the > > > current > > > > conversion to/from parquet is expensive compared to doubles and other > > > > primitives (the nested type needs repetition and definition levels) > > > should > > > > we discuss what’s the recommendation when integrating with other > > > non-arrow > > > > systems or is that an implementation detail only? (Pandas, parquet) > > > > > > > > Best regards, > > > > Adam Lippai > > > > > > > > On Wed, Mar 8, 2023 at 1:13 AM Alenka Frim <ale...@voltrondata.com > > > > .invalid> > > > > wrote: > > > > > > > > > > > > > > > > Just one comment, though: since we also define a separate > "Tensor" > > > IPC > > > > > > structure in Arrow, maybe we should state the relationship > > somewhere > > > in > > > > > the > > > > > > documentation? (Even if the answer is "no relationship".) > > > > > > > > > > > > > > > > Agree David, thanks for bringing it up. > > > > > > > > > > I will add the information about "no relationship" to the Tensor > IPC > > > > > structure into the spec and will also keep in mind to add it to the > > > > > documentation that follows the implementations. > > > > > > > > > > > > > > >