Hey Wes, I appreciate your comments and want to be clear that I am not blocking this addition.
Memory mapping itself is not in conflict with my comments. However, since Arrow datasets do not exist frequently on disk today, a user can make choices as to whether to use smaller or larger batches. When that choice is present, I have a hard time seeing situations where there are real disadvantages of having 1000 batches of a billion records each versus one batch of a trillion records. I understand the argument and need for LargeBinary data in general. However, in those situations I'm not clear what benefits are provided by a columnar representation of data where data is laid out end to end. At that point, you're probably much better off just storing the items individually and using some form of indirection/addressing from an Arrow structure to independent large objects. This all comes down to how much Arrow needs to be all things to all people. I don't argue that there are use cases for this stuff. I just wonder how much any of the structural elements of Arrow benefit such use cases (beyond a nice set of libraries to work with). Personally I would rather address the 64-bit offset issue now so that I > stop hearing the objection from would-be users (I can count a dozen or so > occasions where I've been accosted in person over this issue at conferences > and elsewhere). It would be a good idea to recommend a preference for > 32-bit offsets in our documentation. > I wish we could understand how much of this is people trying to force-fit Arrow into preconceived notions versus true needs that are complementary to the ones the Arrow community benefits from today. (I know this will just continue to be a wish.) I know that as an engineer I am great at pointing out potential constraints in technologies I haven't yet used or fully understood. I wonder how many others have the same failing when looking at Arrow :D J
