Hello all, Thanks for the input!
@Will > Could we name which ones it would be compatible with? UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for size. > As I understood it originally, the current implementations use raw pointers Yes, these implementations use raw pointers exclusively and for transmission via IPC those will require conversion. The compatibility we'd achieve by using unsigned integers is mostly one of documentation and validation; the above engines would accept a string as long as `2**32 - 1`, but with signed integers arrow would reject it. This is a distant edge case so I'd say it's acceptable to exclude it. (For completeness it should be noted that conversion between signed and unsigned is typically a trivial operation and shouldn't be considered as adding to any runtime cost.) @Dewey > ... the spec specifically discourages their use. > If the times have changed and this is no longer the case perhaps there > should be a wider effort to support unsigned values in other places I'm not sure which signed integers you're thinking of. At the risk of constructing a strawman, I would say that for example we shouldn't extend the offsets in the List type to support unsigned integers. This would require adding a new type named UnsignedOffsetsList or so, and I don't think that would offer a useful difference from the existing List type. I would recommend waiting for implementers to motivate such a change by proposing new cases where unsigned integers would be useful to support (as with unsigned dictionary indices). > I would lean towards signed integers because we don't use unsigned > integers anywhere in the existing specification It's true this is the convention in the spec. Since you and @pitrou have both mentioned it (and we haven't heard strong opinions in support of unsigned sizes), I think that's sufficient grounds to avoid the exception. Sincerely, Ben Kietzman [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section 3.1, second paragraph) [2] https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255 [3] https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210 On Tue, Sep 19, 2023 at 4:50 PM Will Jones <will.jones...@gmail.com> wrote: > Hi Ben, > > I'm open to the idea of using unsigned if it allows compatibility with an > existing implementation or two. Could we name which ones it would be > compatible with? Links to implementation code would be very welcome, if > available. > > As I understood it originally, the current implementations use raw pointers > rather than offsets into buffers, so these values already weren't > compatible to begin with. For example, it seems like Velox uses raw > pointers into buffers [1]. So unless I'm missing something, it seems like > these implementations will have to map the pointers to offsets anyways, so > maybe it's not much more trouble to convert to signed integers on the way. > > Best, > > Will Jones > > [1] > > https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78 > > On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > Hello again all, > > > > It seems there hasn't been much interest in this point so I'm leaning > > toward keeping unsigned integers. If anyone has a concern please respond > > here and/or on the PR [1]. > > > > Sincerely, > > Ben Kietzman > > > > [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022 > > > > On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> wrote: > > > > > I think Java was usually raised as the odd child out when this has come > > up > > > before. Since Java 8 there are standard library methods to manipulate > > > signed integers as if they were unsigned, so in principle Java > shouldn't > > be > > > a blocker anymore. > > > > > > That said, ByteBuffer is still indexed by int so in practice Java > > wouldn't > > > be able to handle more than 2 GB in a single buffer, at least until we > > can > > > use the Java 21+ APIs (MemorySegment is finally indexed by (signed) > > long). > > > > > > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote: > > > > Hello all, > > > > > > > > Utf8View was recently accepted [1] and I've opened a PR to add the > > > > spec/schema changes [2]. In review [3], it was requested that signed > 32 > > > bit > > > > integers be used for the fields of view structs instead of 32 bit > > > unsigned. > > > > > > > > This divergence has been discussed on the ML previously [4], but in > > light > > > > of my reviewer's request for a change it should be raised again for > > > focused > > > > discussion. (At this stage, I don't *think* the change would require > > > > another vote.) I'll enumerate the motivations for signed and unsigned > > as > > > I > > > > understand them. > > > > > > > > Signed: > > > > - signed integers are conventional in the arrow format > > > > - unsigned integers may cause some difficulty of implementation in > > > > languages which don't natively support them > > > > > > > > Unsigned: > > > > - unsigned integers are used by engines which already implement > > Utf8View > > > > > > > > My own bias is toward compatibility with existing implementers, but > > using > > > > signed integers will only affect the case of arrays which include > data > > > > buffers larger than 2GB. For reference, the default buffer size in > > velox > > > is > > > > 32KB so such a massive data buffer would only occur when a single > slot > > > of a > > > > string array has 2.1GB of characters. This seems sufficiently > unlikely > > > that > > > > I wouldn't consider it a blocker. > > > > > > > > Sincerely, > > > > Ben Kietzman > > > > > > > > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6 > > > > [2] https://github.com/apache/arrow/pull/37526 > > > > [3] > https://github.com/apache/arrow/pull/37526#discussion_r1323029022 > > > > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt > > > > [5] > > > > > > > > > > https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50 > > > > > >