There must be an explanation about 83 MB of compressed data getting almost double of its size. It doesn't make sense at all.
On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <va...@apache.org> wrote: > > > On Oct 26, 2024, at 14:50, Prashant Saxena <animator...@gmail.com> > wrote: > > > > I just need to store compressed strings to save space. If it can be > done in > > any other way, I'm OK with that. > > The JArray('byte') is the way. > > Andi.. > > > > > > >> On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote: > >> > >> > >>> On Sat, 26 Oct 2024, Prashant Saxena wrote: > >>> > >>> PyLucene 10.0.0 > >>> > >>> I'm trying to store a long text by compressing it first using zlib > >>> > >>> *doc.add(StoredField("contents", > zlib.compress(ftext.encode('utf-8'))))* > >>> > >>> The resulting index size is *~83 MB*. When reading it's value back > using > >>> > >>> *c = doc.getBinaryValue("contents")* > >>> > >>> It's returning 'NoneType' and when using > >>> > >>> *c = doc.get("contents")* > >>> > >>> It's returning a string which cannot be decompressed. > >>> > >>> When using > >>> > >>> *doc.add(StoredField("contents", > >>> JArray('byte')(zlib.compress(ftext.encode('utf-8')))))* > >>> > >>> The resulting index size is ~*160 MB. *There is no problem in getting > >> it's > >>> value using > >>> > >>> > >>> > >>> *c = doc.getBinaryValue("contents")cc = > >>> zlib.decompress(c.bytes.bytes_).decode('utf-8') * > >>> > >>> *Question 1 : *Why does the index size almost double when using JArray? > >> > >> Because the value you're passing is actually processed correctly ? > >> > >>> *Question 2: *How do you correctly create and store compressed binary > >> data > >>> in StoredField ? > >> > >> If you want a python byte object, like b'abcd', to be seen by Lucene > >> (Java) > >> as a byte array, you should wrap it with a JArray('byte') like you did. > >> Otherwise, it's seen as a string (I need to double-check) and not > handled > >> correctly. > >> > >>> I am using PyLucene in my current project. Please advise me if I should > >>> post my questions on the java-user list instead of here. > >> > >> This particular question is specific to PyLucene and should be asked > here, > >> like you did ;-) > >> > >> Andi.. > >> > >