Thanks for the feedback, everyone! I'm calling this in favor of signed integers. That's the last change requested on the format PR, so I'll probably merge shortly.
On Wed, Sep 20, 2023, 17:43 Matt Topol <m...@voltrondata.com.invalid> wrote: > Just to chime in (and add yet another voice into the mix here), I'd have a > preference for it being signed integers for the same reasons as most > everyone else: consistency with everything else in the spec. Since we use > signed integers everywhere, I'd prefer to keep it consistent rather than > introduce an exception. > > --Matt > > On Wed, Sep 20, 2023 at 11:19 AM Benjamin Kietzman <bengil...@gmail.com> > wrote: > > > Hello all, > > > > Thanks for the input! > > > > @Will > > > > > Could we name which ones it would be compatible with? > > > > UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for > size. > > > > > As I understood it originally, the current implementations use raw > > pointers > > > > Yes, these implementations use raw pointers exclusively and for > > transmission via IPC > > those will require conversion. The compatibility we'd achieve by using > > unsigned > > integers is mostly one of documentation and validation; the above engines > > would > > accept a string as long as `2**32 - 1`, but with signed integers arrow > > would reject > > it. This is a distant edge case so I'd say it's acceptable to exclude it. > > (For completeness it should be noted that conversion between signed and > > unsigned is > > typically a trivial operation and shouldn't be considered as adding to > any > > runtime cost.) > > > > @Dewey > > > ... the spec specifically discourages their use. > > > If the times have changed and this is no longer the case perhaps there > > > should be a wider effort to support unsigned values in other places > > > > I'm not sure which signed integers you're thinking of. At the risk of > > constructing a > > strawman, I would say that for example we shouldn't extend the offsets in > > the List type > > to support unsigned integers. This would require adding a new type named > > UnsignedOffsetsList or so, and I don't think that would offer a useful > > difference from > > the existing List type. I would recommend waiting for implementers to > > motivate such a > > change by proposing new cases where unsigned integers would be useful to > > support > > (as with unsigned dictionary indices). > > > > > I would lean towards signed integers because we don't use unsigned > > > integers anywhere in the existing specification > > > > It's true this is the convention in the spec. Since you and @pitrou have > > both mentioned > > it (and we haven't heard strong opinions in support of unsigned sizes), I > > think that's > > sufficient grounds to avoid the exception. > > > > Sincerely, > > Ben Kietzman > > > > [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section > 3.1, > > second paragraph) > > [2] > > > > > https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255 > > [3] > > > > > https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210 > > > > On Tue, Sep 19, 2023 at 4:50 PM Will Jones <will.jones...@gmail.com> > > wrote: > > > > > Hi Ben, > > > > > > I'm open to the idea of using unsigned if it allows compatibility with > an > > > existing implementation or two. Could we name which ones it would be > > > compatible with? Links to implementation code would be very welcome, if > > > available. > > > > > > As I understood it originally, the current implementations use raw > > pointers > > > rather than offsets into buffers, so these values already weren't > > > compatible to begin with. For example, it seems like Velox uses raw > > > pointers into buffers [1]. So unless I'm missing something, it seems > like > > > these implementations will have to map the pointers to offsets anyways, > > so > > > maybe it's not much more trouble to convert to signed integers on the > > way. > > > > > > Best, > > > > > > Will Jones > > > > > > [1] > > > > > > > > > https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78 > > > > > > On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman <bengil...@gmail.com > > > > > wrote: > > > > > > > Hello again all, > > > > > > > > It seems there hasn't been much interest in this point so I'm leaning > > > > toward keeping unsigned integers. If anyone has a concern please > > respond > > > > here and/or on the PR [1]. > > > > > > > > Sincerely, > > > > Ben Kietzman > > > > > > > > [1] > https://github.com/apache/arrow/pull/37526#discussion_r1323029022 > > > > > > > > On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> > wrote: > > > > > > > > > I think Java was usually raised as the odd child out when this has > > come > > > > up > > > > > before. Since Java 8 there are standard library methods to > manipulate > > > > > signed integers as if they were unsigned, so in principle Java > > > shouldn't > > > > be > > > > > a blocker anymore. > > > > > > > > > > That said, ByteBuffer is still indexed by int so in practice Java > > > > wouldn't > > > > > be able to handle more than 2 GB in a single buffer, at least until > > we > > > > can > > > > > use the Java 21+ APIs (MemorySegment is finally indexed by (signed) > > > > long). > > > > > > > > > > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote: > > > > > > Hello all, > > > > > > > > > > > > Utf8View was recently accepted [1] and I've opened a PR to add > the > > > > > > spec/schema changes [2]. In review [3], it was requested that > > signed > > > 32 > > > > > bit > > > > > > integers be used for the fields of view structs instead of 32 bit > > > > > unsigned. > > > > > > > > > > > > This divergence has been discussed on the ML previously [4], but > in > > > > light > > > > > > of my reviewer's request for a change it should be raised again > for > > > > > focused > > > > > > discussion. (At this stage, I don't *think* the change would > > require > > > > > > another vote.) I'll enumerate the motivations for signed and > > unsigned > > > > as > > > > > I > > > > > > understand them. > > > > > > > > > > > > Signed: > > > > > > - signed integers are conventional in the arrow format > > > > > > - unsigned integers may cause some difficulty of implementation > in > > > > > > languages which don't natively support them > > > > > > > > > > > > Unsigned: > > > > > > - unsigned integers are used by engines which already implement > > > > Utf8View > > > > > > > > > > > > My own bias is toward compatibility with existing implementers, > but > > > > using > > > > > > signed integers will only affect the case of arrays which include > > > data > > > > > > buffers larger than 2GB. For reference, the default buffer size > in > > > > velox > > > > > is > > > > > > 32KB so such a massive data buffer would only occur when a single > > > slot > > > > > of a > > > > > > string array has 2.1GB of characters. This seems sufficiently > > > unlikely > > > > > that > > > > > > I wouldn't consider it a blocker. > > > > > > > > > > > > Sincerely, > > > > > > Ben Kietzman > > > > > > > > > > > > [1] > > https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6 > > > > > > [2] https://github.com/apache/arrow/pull/37526 > > > > > > [3] > > > https://github.com/apache/arrow/pull/37526#discussion_r1323029022 > > > > > > [4] > > https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt > > > > > > [5] > > > > > > > > > > > > > > > > > > > > > https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50 > > > > > > > > > > > > > > >