On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas < aldcr...@head.cfa.harvard.edu> wrote:
> BTW -- maybe we should keep the pathological use-case in mind: really >> short strings. I think we are all thinking in terms of longer strings, >> maybe a name field, where you might assign 32 bytes or so -- then someone >> has an accented character in their name, and then ge30 or 31 characters -- >> no big deal. >> > > I wouldn't call it a pathological use case, it doesn't seem so uncommon to > have large datasets of short strings. > It's pathological for using a variable-length encoding. > I personally deal with a database of hundreds of billions of 2 to 5 > character ASCII strings. This has been a significant blocker to Python 3 > adoption in my world. > I agree -- it is a VERY common case for scientific data sets. But a one-byte-per-char encoding would handle it nicely, or UCS-4 if you want Unicode. The wasted space is not that big a deal with short strings... BTW, for those new to the list or with a short memory, this topic has been > discussed fairly extensively at least 3 times before. Hopefully the > *fourth* time will be the charm! > yes, let's hope so! The big difference now is that Julian seems to be committed to actually making it happen! Thanks Julian! Which brings up a good point -- if you need us to stop the damn bike-shedding so you can get it done -- say so. I have strong opinions, but would still rather see any of the ideas on the table implemented than nothing. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion