Re: Store byte array in StoredField using zlib compression

Andi Vajda Sun, 27 Oct 2024 00:57:16 -0700


On Sun, 27 Oct 2024, Prashant Saxena wrote:

Ok, Everything has been cleared out about the problem. Please let me know
how to get this

*from org.apache.lucene.codecs.lucene100 import Lucene100Codec*
*print(Lucene100Codec.Mode.BEST_COMPRESSION)*


Strange, it works for me:
  >>> import lucene
  >>> lucene.initVM()
  <jcc.JCCEnv object at 0x10317caf0>
  >>> from org.apache.lucene.codecs.lucene100 import Lucene100Codec
  >>> Lucene100Codec.Mode
  <class 'org.apache.lucene.codecs.lucene100.Lucene100Codec$Mode'>
  >>> dir(Lucene100Codec.Mode)

['BEST_COMPRESSION', 'BEST_SPEED', 'EnumDesc', '__class__', '__delattr__','__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__','__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__','__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__','__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__','__subclasshook__', '_jobject', 'boxfn_', 'cast_', 'class', 'class_','compareTo', 'declaringClass', 'describeConstable', 'equals', 'getClass','getDeclaringClass', 'hashCode', 'instance_', 'name', 'notify', 'notifyAll','of_', 'ordinal', 'parameters_', 'toString', 'valueOf', 'values', 'wait','wrapfn_']

  >>> Lucene100Codec.Mode.BEST_COMPRESSION
  <Lucene100Codec$Mode: BEST_COMPRESSION>
  >>> print(Lucene100Codec.Mode.BEST_COMPRESSION)
  BEST_COMPRESSION

Andi..


Error

AttributeError: type object 'Lucene100Codec$Mode' has no attribute
'BEST_COMPRESSION'

I need it here:

config = IndexWriterConfig(analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
config.setCode(Lucene100Codec(Lucene100Codec.Mode.BEST_COMPRESSION))

Prashant

On Sat, Oct 26, 2024 at 8:13 PM Andi Vajda <va...@apache.org> wrote:

On Oct 26, 2024, at 16:21, Prashant Saxena <animator...@gmail.com>

wrote:


There must be an explanation about 83 MB of compressed data getting

almost

double of its size. It doesn't make sense at all.


When not using a JArray('byte') your python byte array is converted into a
partial java string and is being corrupted, probably at the first utf-8
conversion error. I didn't actually verify this, I'm not near my computer
but you're comparing a working solution with a non-working one 😊

Andi..

On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <va...@apache.org> wrote:

On Oct 26, 2024, at 14:50, Prashant Saxena <animator...@gmail.com>

wrote:


I just need to store compressed strings to save space. If it can be

done in

any other way, I'm OK with that.


The JArray('byte') is the way.

Andi..

On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote:

On Sat, 26 Oct 2024, Prashant Saxena wrote:

PyLucene 10.0.0

I'm trying to store a long text by compressing it first using zlib

*doc.add(StoredField("contents",

zlib.compress(ftext.encode('utf-8'))))*


The resulting index size is *~83 MB*. When reading it's value back

using


*c = doc.getBinaryValue("contents")*

It's returning 'NoneType' and when using

*c = doc.get("contents")*

It's returning a string which cannot be decompressed.

When using

*doc.add(StoredField("contents",
JArray('byte')(zlib.compress(ftext.encode('utf-8')))))*

The resulting index size is ~*160 MB. *There is no problem in getting

it's

value using



*c = doc.getBinaryValue("contents")cc =
zlib.decompress(c.bytes.bytes_).decode('utf-8') *

*Question 1 : *Why does the index size almost double when using

JArray?


Because the value you're passing is actually processed correctly ?

*Question 2: *How do you correctly create and store compressed binary

data

in StoredField ?


If you want a python byte object, like b'abcd', to be seen by Lucene
(Java)
as a byte array, you should wrap it with a JArray('byte') like you

did.

Otherwise, it's seen as a string (I need to double-check) and not

handled

correctly.

I am using PyLucene in my current project. Please advise me if I

should

post my questions on the java-user list instead of here.


This particular question is specific to PyLucene and should be asked

here,

like you did ;-)

Andi..

Re: Store byte array in StoredField using zlib compression

Reply via email to