> On Oct 26, 2024, at 16:21, Prashant Saxena <animator...@gmail.com> wrote: > > There must be an explanation about 83 MB of compressed data getting almost > double of its size. It doesn't make sense at all.
When not using a JArray('byte') your python byte array is converted into a partial java string and is being corrupted, probably at the first utf-8 conversion error. I didn't actually verify this, I'm not near my computer but you're comparing a working solution with a non-working one 😊 Andi.. > >> On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <va...@apache.org> wrote: >> >> >>> On Oct 26, 2024, at 14:50, Prashant Saxena <animator...@gmail.com> >> wrote: >>> >>> I just need to store compressed strings to save space. If it can be >> done in >>> any other way, I'm OK with that. >> >> The JArray('byte') is the way. >> >> Andi.. >> >>> >>> >>>> On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote: >>>> >>>> >>>>> On Sat, 26 Oct 2024, Prashant Saxena wrote: >>>>> >>>>> PyLucene 10.0.0 >>>>> >>>>> I'm trying to store a long text by compressing it first using zlib >>>>> >>>>> *doc.add(StoredField("contents", >> zlib.compress(ftext.encode('utf-8'))))* >>>>> >>>>> The resulting index size is *~83 MB*. When reading it's value back >> using >>>>> >>>>> *c = doc.getBinaryValue("contents")* >>>>> >>>>> It's returning 'NoneType' and when using >>>>> >>>>> *c = doc.get("contents")* >>>>> >>>>> It's returning a string which cannot be decompressed. >>>>> >>>>> When using >>>>> >>>>> *doc.add(StoredField("contents", >>>>> JArray('byte')(zlib.compress(ftext.encode('utf-8')))))* >>>>> >>>>> The resulting index size is ~*160 MB. *There is no problem in getting >>>> it's >>>>> value using >>>>> >>>>> >>>>> >>>>> *c = doc.getBinaryValue("contents")cc = >>>>> zlib.decompress(c.bytes.bytes_).decode('utf-8') * >>>>> >>>>> *Question 1 : *Why does the index size almost double when using JArray? >>>> >>>> Because the value you're passing is actually processed correctly ? >>>> >>>>> *Question 2: *How do you correctly create and store compressed binary >>>> data >>>>> in StoredField ? >>>> >>>> If you want a python byte object, like b'abcd', to be seen by Lucene >>>> (Java) >>>> as a byte array, you should wrap it with a JArray('byte') like you did. >>>> Otherwise, it's seen as a string (I need to double-check) and not >> handled >>>> correctly. >>>> >>>>> I am using PyLucene in my current project. Please advise me if I should >>>>> post my questions on the java-user list instead of here. >>>> >>>> This particular question is specific to PyLucene and should be asked >> here, >>>> like you did ;-) >>>> >>>> Andi.. >>>> >> >>