> On Oct 26, 2024, at 16:21, Prashant Saxena <animator...@gmail.com> wrote:
> 
> There must be an explanation about 83 MB of compressed data getting almost
> double of its size. It doesn't make sense at all.

When not using a JArray('byte') your python byte array is converted into a 
partial java string and is being corrupted, probably at the first utf-8 
conversion error. I didn't actually verify this, I'm not near my computer but 
you're comparing a working solution with a non-working one 😊

Andi..

> 
>> On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <va...@apache.org> wrote:
>> 
>> 
>>> On Oct 26, 2024, at 14:50, Prashant Saxena <animator...@gmail.com>
>> wrote:
>>> 
>>> I just need to store compressed strings to save space. If it can be
>> done in
>>> any other way, I'm OK with that.
>> 
>> The JArray('byte') is the way.
>> 
>> Andi..
>> 
>>> 
>>> 
>>>> On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote:
>>>> 
>>>> 
>>>>> On Sat, 26 Oct 2024, Prashant Saxena wrote:
>>>>> 
>>>>> PyLucene 10.0.0
>>>>> 
>>>>> I'm trying to store a long text by compressing it first using zlib
>>>>> 
>>>>> *doc.add(StoredField("contents",
>> zlib.compress(ftext.encode('utf-8'))))*
>>>>> 
>>>>> The resulting index size is *~83 MB*. When reading it's value back
>> using
>>>>> 
>>>>> *c = doc.getBinaryValue("contents")*
>>>>> 
>>>>> It's returning 'NoneType' and when using
>>>>> 
>>>>> *c = doc.get("contents")*
>>>>> 
>>>>> It's returning a string which cannot be decompressed.
>>>>> 
>>>>> When using
>>>>> 
>>>>> *doc.add(StoredField("contents",
>>>>> JArray('byte')(zlib.compress(ftext.encode('utf-8')))))*
>>>>> 
>>>>> The resulting index size is ~*160 MB. *There is no problem in getting
>>>> it's
>>>>> value using
>>>>> 
>>>>> 
>>>>> 
>>>>> *c = doc.getBinaryValue("contents")cc =
>>>>> zlib.decompress(c.bytes.bytes_).decode('utf-8') *
>>>>> 
>>>>> *Question 1 : *Why does the index size almost double when using JArray?
>>>> 
>>>> Because the value you're passing is actually processed correctly ?
>>>> 
>>>>> *Question 2: *How do you correctly create and store compressed binary
>>>> data
>>>>> in StoredField ?
>>>> 
>>>> If you want a python byte object, like b'abcd', to be seen by Lucene
>>>> (Java)
>>>> as a byte array, you should wrap it with a JArray('byte') like you did.
>>>> Otherwise, it's seen as a string (I need to double-check) and not
>> handled
>>>> correctly.
>>>> 
>>>>> I am using PyLucene in my current project. Please advise me if I should
>>>>> post my questions on the java-user list instead of here.
>>>> 
>>>> This particular question is specific to PyLucene and should be asked
>> here,
>>>> like you did ;-)
>>>> 
>>>> Andi..
>>>> 
>> 
>> 

Reply via email to