On Sun, 27 Oct 2024, Prashant Saxena wrote:
Ok, Everything has been cleared out about the problem. Please let me know
how to get this
*from org.apache.lucene.codecs.lucene100 import Lucene100Codec*
*print(Lucene100Codec.Mode.BEST_COMPRESSION)*
Strange, it works for me:
>>> import lucene
>>> lucene.initVM()
<jcc.JCCEnv object at 0x10317caf0>
>>> from org.apache.lucene.codecs.lucene100 import Lucene100Codec
>>> Lucene100Codec.Mode
<class 'org.apache.lucene.codecs.lucene100.Lucene100Codec$Mode'>
>>> dir(Lucene100Codec.Mode)
['BEST_COMPRESSION', 'BEST_SPEED', 'EnumDesc', '__class__', '__delattr__',
'__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
'__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__',
'__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', '_jobject', 'boxfn_', 'cast_', 'class', 'class_',
'compareTo', 'declaringClass', 'describeConstable', 'equals', 'getClass',
'getDeclaringClass', 'hashCode', 'instance_', 'name', 'notify', 'notifyAll',
'of_', 'ordinal', 'parameters_', 'toString', 'valueOf', 'values', 'wait',
'wrapfn_']
>>> Lucene100Codec.Mode.BEST_COMPRESSION
<Lucene100Codec$Mode: BEST_COMPRESSION>
>>> print(Lucene100Codec.Mode.BEST_COMPRESSION)
BEST_COMPRESSION
Andi..
Error
AttributeError: type object 'Lucene100Codec$Mode' has no attribute
'BEST_COMPRESSION'
I need it here:
config = IndexWriterConfig(analyzer)
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
config.setCode(Lucene100Codec(Lucene100Codec.Mode.BEST_COMPRESSION))
Prashant
On Sat, Oct 26, 2024 at 8:13 PM Andi Vajda <va...@apache.org> wrote:
On Oct 26, 2024, at 16:21, Prashant Saxena <animator...@gmail.com>
wrote:
There must be an explanation about 83 MB of compressed data getting
almost
double of its size. It doesn't make sense at all.
When not using a JArray('byte') your python byte array is converted into a
partial java string and is being corrupted, probably at the first utf-8
conversion error. I didn't actually verify this, I'm not near my computer
but you're comparing a working solution with a non-working one 😊
Andi..
On Sat, Oct 26, 2024 at 7:03 PM Andi Vajda <va...@apache.org> wrote:
On Oct 26, 2024, at 14:50, Prashant Saxena <animator...@gmail.com>
wrote:
I just need to store compressed strings to save space. If it can be
done in
any other way, I'm OK with that.
The JArray('byte') is the way.
Andi..
On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote:
On Sat, 26 Oct 2024, Prashant Saxena wrote:
PyLucene 10.0.0
I'm trying to store a long text by compressing it first using zlib
*doc.add(StoredField("contents",
zlib.compress(ftext.encode('utf-8'))))*
The resulting index size is *~83 MB*. When reading it's value back
using
*c = doc.getBinaryValue("contents")*
It's returning 'NoneType' and when using
*c = doc.get("contents")*
It's returning a string which cannot be decompressed.
When using
*doc.add(StoredField("contents",
JArray('byte')(zlib.compress(ftext.encode('utf-8')))))*
The resulting index size is ~*160 MB. *There is no problem in getting
it's
value using
*c = doc.getBinaryValue("contents")cc =
zlib.decompress(c.bytes.bytes_).decode('utf-8') *
*Question 1 : *Why does the index size almost double when using
JArray?
Because the value you're passing is actually processed correctly ?
*Question 2: *How do you correctly create and store compressed binary
data
in StoredField ?
If you want a python byte object, like b'abcd', to be seen by Lucene
(Java)
as a byte array, you should wrap it with a JArray('byte') like you
did.
Otherwise, it's seen as a string (I need to double-check) and not
handled
correctly.
I am using PyLucene in my current project. Please advise me if I
should
post my questions on the java-user list instead of here.
This particular question is specific to PyLucene and should be asked
here,
like you did ;-)
Andi..