On Fri, 2004-05-14 at 19:35, Dmitry Serebrennikov wrote:
> >
> > Sounds like a good plan. String-values remain as fast as they are,
> > and binary values
> are no slower. We can easily layer compression,
> > etc. on top of this.
> >
> > Are you volunteering?
>
> :)
> I'm pretty well pressed for time right now, so if someone else can pick
> this up it would probably get done sooner.
> Let me see how my weekend pans out.
Hi All,
I'm new here, so I'm not sure what the proper formalities for doing this
are, but I had some free time today and whipped up a patch that adds
binary value support to Field based on what's been already discussed.
Since its my first contribution ever, if it's not 100% perfect please
forgive, and maybe it will be of some use to Dimitry or anyone else who
was planning on or in the midst of implementing this.
This is not extensively tested, and I was hoping from some guidance from
the other developers in this area. I modified the unit test for Document
to verify it's operation -- are there any others that I should update to
fully test this addition? Are the unit tests sufficient, or should I go
to the extent of building a little app to test this and do some actual
searching?
At any rate, I hope this is useful to some degree. This patch is
performed against today's HEAD. Should I be patching against tagged
releases?
Any critique is welcome.
Drew
? build
? dist
? junit1855686656.properties
? junit884806328.properties
? src/java/org/apache/lucene/document/.nbattrs
? src/test/org/apache/lucene/document/.nbattrs
Index: src/java/org/apache/lucene/document/Document.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/document/Document.java,v
retrieving revision 1.19
diff -c -r1.19 Document.java
*** src/java/org/apache/lucene/document/Document.java 21 Apr 2004 17:08:04 -0000 1.19
--- src/java/org/apache/lucene/document/Document.java 15 May 2004 21:44:28 -0000
***************
*** 199,204 ****
--- 199,224 ----
return values;
}
+ /**
+ * Returns an array of values of the fields specified as the method parameter.
+ * This methoc can return <code>null</code>. It is also possible
+ * that the first dimension array of the two dimensional array can contain
+ * nulls if fields do not have binary values
+ *
+ * @param name the name of the field
+ * @return a <code>byte[][]</code> of binary field values.
+ */
+ public final byte[][] getBinaryValues(String name) {
+ Field[] namedFields = getFields(name);
+ if (namedFields == null)
+ return null;
+ byte[][] values = new byte[namedFields.length][];
+ for (int i = 0; i < namedFields.length; i++) {
+ values[i] = namedFields[i].binaryValue();
+ }
+ return values;
+ }
+
/** Prints the fields of a document for human consumption. */
public final String toString() {
StringBuffer buffer = new StringBuffer();
Index: src/java/org/apache/lucene/document/Field.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/document/Field.java,v
retrieving revision 1.14
diff -c -r1.14 Field.java
*** src/java/org/apache/lucene/document/Field.java 16 Apr 2004 09:48:25 -0000 1.14
--- src/java/org/apache/lucene/document/Field.java 15 May 2004 21:44:28 -0000
***************
*** 24,44 ****
/**
A field is a section of a Document. Each field has two parts, a name and a
! value. Values may be free text, provided as a String or as a Reader, or they
! may be atomic keywords, which are not further processed. Such keywords may
! be used to represent dates, urls, etc. Fields are optionally stored in the
! index, so that they may be returned with hits on the document.
*/
public final class Field implements java.io.Serializable {
private String name = "body";
private String stringValue = null;
private boolean storeTermVector = false;
private Reader readerValue = null;
private boolean isStored = false;
private boolean isIndexed = true;
private boolean isTokenized = true;
!
private float boost = 1.0f;
/** Sets the boost factor hits on this field. This value will be
--- 24,48 ----
/**
A field is a section of a Document. Each field has two parts, a name and a
! value. Values may be free text, provided as a String, or as a Reader, they
! may be atomic keywords which are not further processed. Such keywords may
! be used to represent dates, urls, etc. Fields may also store binary values
! which can be used to store compressed data in the index. Fields are
! optionally stored in the index, so that they may be returned with hits
! on the document. Binary fields are always stored in the index.
*/
public final class Field implements java.io.Serializable {
private String name = "body";
private String stringValue = null;
+ private byte[] binaryValue = null;
private boolean storeTermVector = false;
private Reader readerValue = null;
private boolean isStored = false;
private boolean isIndexed = true;
private boolean isTokenized = true;
! private boolean isBinary = false;
!
private float boost = 1.0f;
/** Sets the boost factor hits on this field. This value will be
***************
*** 136,154 ****
f.storeTermVector = storeTermVector;
return f;
}
!
/** The name of the field (e.g., "date", "subject", "title", or "body")
as an interned string. */
public String name() { return name; }
! /** The value of the field as a String, or null. If null, the Reader value
! is used. Exactly one of stringValue() and readerValue() must be set. */
! public String stringValue() { return stringValue; }
! /** The value of the field as a Reader, or null. If null, the String value
! is used. Exactly one of stringValue() and readerValue() must be set. */
public Reader readerValue() { return readerValue; }
!
/** Create a field by specifying all parameters except for <code>storeTermVector</code>,
* which is set to <code>false</code>.
*/
--- 140,171 ----
f.storeTermVector = storeTermVector;
return f;
}
!
! /** Constructs a Binary-valued field that is not tokenixed nor indexed, but is
! stored in the index verbatim. Useful for storing compressed data in the
! index, for return with hits. */
! public static final Field Binary(String name, byte[] value) {
! return new Field(name, value);
! }
!
/** The name of the field (e.g., "date", "subject", "title", or "body")
as an interned string. */
public String name() { return name; }
! /** The value of the field as a String, or null. If null, the Reader or
! Binary value is used. Exactly one of stringValue(), readerValue() and
! binaryValue() must be set. */
! public String stringValue() { return stringValue; }
! /** The value of the field as a Reader, or null. If null, the String or
! Binary value is used. Exactly one of stringValue(), readerValue() and
! binaryValue() must be set. */
public Reader readerValue() { return readerValue; }
+ /** The value of the field in Binary, or null. If null, the Reader or
+ String value is used. Exactly one of stringValue(), readerValue() and
+ binaryValue() must be set. */
+ public byte[] binaryValue() { return binaryValue; }
!
/** Create a field by specifying all parameters except for <code>storeTermVector</code>,
* which is set to <code>false</code>.
*/
***************
*** 193,198 ****
--- 210,230 ----
this.readerValue = reader;
}
+ Field(String name, byte[] value) {
+ if (name == null)
+ throw new IllegalArgumentException("name cannot be null");
+ if (value == null)
+ throw new IllegalArgumentException("value cannot be null");
+
+ this.name = name.intern();
+ this.binaryValue = value;
+
+ this.isBinary = true;
+ this.isStored = true;
+ this.isIndexed = false;
+ this.isTokenized = false;
+ }
+
/** True iff the value of the field is to be stored in the index for return
with search hits. It is an error for this to be true if a field is
Reader-valued. */
***************
*** 207,212 ****
--- 239,247 ----
Reader-valued. */
public final boolean isTokenized() { return isTokenized; }
+ /** True iff the value of the filed is stored as binary */
+ public final boolean isBinary() { return isBinary; }
+
/** True iff the term or terms used to index this field are stored as a term
* vector, available from [EMAIL PROTECTED] IndexReader#getTermFreqVector(int,String)}.
* These methods do not provide access to the original content of the field,
***************
*** 221,226 ****
--- 256,263 ----
public final String toString() {
if (isStored && isIndexed && !isTokenized)
return "Keyword<" + name + ":" + stringValue + ">";
+ else if (isBinary)
+ return "Binary<" + name + ">";
else if (isStored && !isIndexed && !isTokenized)
return "Unindexed<" + name + ":" + stringValue + ">";
else if (isStored && isIndexed && isTokenized && stringValue!=null)
***************
*** 228,240 ****
else if (!isStored && isIndexed && isTokenized && readerValue!=null)
return "Text<" + name + ":" + readerValue + ">";
else if (!isStored && isIndexed && isTokenized)
- {
return "UnStored<" + name + ">";
- }
else
- {
return super.toString();
- }
}
}
--- 265,273 ----
Index: src/java/org/apache/lucene/index/FieldsReader.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/FieldsReader.java,v
retrieving revision 1.7
diff -c -r1.7 FieldsReader.java
*** src/java/org/apache/lucene/index/FieldsReader.java 29 Mar 2004 22:48:02 -0000 1.7
--- src/java/org/apache/lucene/index/FieldsReader.java 15 May 2004 21:44:28 -0000
***************
*** 67,77 ****
byte bits = fieldsStream.readByte();
! doc.add(new Field(fi.name, // name
! fieldsStream.readString(), // read value
! true, // stored
! fi.isIndexed, // indexed
! (bits & 1) != 0, fi.storeTermVector)); // vector
}
return doc;
--- 67,83 ----
byte bits = fieldsStream.readByte();
! if ((bits & 2) != 0) {
! final byte[] b = new byte[fieldsStream.readVInt()];
! fieldsStream.readBytes(b, 0, b.length);
! doc.add(Field.Binary(fi.name, b));
! }
! else
! doc.add(new Field(fi.name, // name
! fieldsStream.readString(), // read value
! true, // stored
! fi.isIndexed, // indexed
! (bits & 1) != 0, fi.storeTermVector)); // vector
}
return doc;
Index: src/java/org/apache/lucene/index/FieldsWriter.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/FieldsWriter.java,v
retrieving revision 1.3
diff -c -r1.3 FieldsWriter.java
*** src/java/org/apache/lucene/index/FieldsWriter.java 29 Mar 2004 22:48:02 -0000 1.3
--- src/java/org/apache/lucene/index/FieldsWriter.java 15 May 2004 21:44:28 -0000
***************
*** 62,70 ****
byte bits = 0;
if (field.isTokenized())
bits |= 1;
fieldsStream.writeByte(bits);
! fieldsStream.writeString(field.stringValue());
}
}
}
--- 62,80 ----
byte bits = 0;
if (field.isTokenized())
bits |= 1;
+
+ if (field.isBinary())
+ bits |= 2;
+
fieldsStream.writeByte(bits);
! if (field.isBinary()) {
! final int len = field.binaryValue().length;
! fieldsStream.writeVInt(len);
! fieldsStream.writeBytes(field.binaryValue(), len);
! }
! else
! fieldsStream.writeString(field.stringValue());
}
}
}
Index: src/test/org/apache/lucene/document/TestDocument.java
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/document/TestDocument.java,v
retrieving revision 1.4
diff -c -r1.4 TestDocument.java
*** src/test/org/apache/lucene/document/TestDocument.java 20 Apr 2004 17:26:16 -0000 1.4
--- src/test/org/apache/lucene/document/TestDocument.java 15 May 2004 21:44:35 -0000
***************
*** 50,55 ****
--- 50,57 ----
public void testRemoveForNewDocument() throws Exception
{
Document doc = makeDocumentWithFields();
+ assertEquals(10, doc.fields.size());
+ doc.removeFields("binary");
assertEquals(8, doc.fields.size());
doc.removeFields("keyword");
assertEquals(6, doc.fields.size());
***************
*** 131,136 ****
--- 133,140 ----
doc.add(Field.UnIndexed("unindexed", "test2"));
doc.add(Field.UnStored( "unstored", "test1"));
doc.add(Field.UnStored( "unstored", "test2"));
+ doc.add(Field.Binary( "binary" , "test1".getBytes()));
+ doc.add(Field.Binary( "binary" , "test2".getBytes()));
return doc;
}
***************
*** 140,149 ****
String[] textFieldValues = doc.getValues("text");
String[] unindexedFieldValues = doc.getValues("unindexed");
String[] unstoredFieldValues = doc.getValues("unstored");
!
assertTrue(keywordFieldValues.length == 2);
assertTrue(textFieldValues.length == 2);
assertTrue(unindexedFieldValues.length == 2);
// this test cannot work for documents retrieved from the index
// since unstored fields will obviously not be returned
if (! fromIndex)
--- 144,155 ----
String[] textFieldValues = doc.getValues("text");
String[] unindexedFieldValues = doc.getValues("unindexed");
String[] unstoredFieldValues = doc.getValues("unstored");
! byte[][] binaryFieldValues = doc.getBinaryValues("binary");
!
assertTrue(keywordFieldValues.length == 2);
assertTrue(textFieldValues.length == 2);
assertTrue(unindexedFieldValues.length == 2);
+ assertTrue(binaryFieldValues.length == 2);
// this test cannot work for documents retrieved from the index
// since unstored fields will obviously not be returned
if (! fromIndex)
***************
*** 157,162 ****
--- 163,170 ----
assertTrue(textFieldValues[1].equals("test2"));
assertTrue(unindexedFieldValues[0].equals("test1"));
assertTrue(unindexedFieldValues[1].equals("test2"));
+ assertTrue(new String(binaryFieldValues[0]).equals("test1"));
+ assertTrue(new String(binaryFieldValues[1]).equals("test2"));
// this test cannot work for documents retrieved from the index
// since unstored fields will obviously not be returned
if (! fromIndex)
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]