[jira] [Commented] (LUCENE-7171) IndexableField changes its IndexableFieldType when the index is re-opened for reading

2016-05-23 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296595#comment-15296595
 ] 

Michael McCandless commented on LUCENE-7171:


This is indeed irritating, but it is a long standing issue in Lucene: it does 
not in fact store all attributes (such as the "was this field tokenized?" 
boolean), which means on loading the document it "guesses" (incorrectly in your 
case).

We tried to fix this before, in LUCENE-3312, which introduced a different 
document class ({{StoredDocument}}) at search time to make it strongly typed so 
that it was clear Lucene would not store these attributes.

But that proved problematic and we eventually reverted the change in 
LUCENE-6971 and now we are back in the trappy state.

> IndexableField changes its IndexableFieldType when the index is re-opened for 
> reading
> -
>
> Key: LUCENE-7171
> URL: https://issues.apache.org/jira/browse/LUCENE-7171
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.5
>Reporter: Roberto Cornacchia
>
> This code:
> {code}
> /* Store one document into an index */
> Directory index = new RAMDirectory();
> IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
> IndexWriter w = new IndexWriter(index, config);
> Document d1 = new Document();
> d1.add(new StringField("isbn", "9900333X", Field.Store.YES));
> w.addDocument(d1);
> w.commit();
> w.close();
> /* inspect IndexableFieldType */
> IndexableField f1 = d1.getField("isbn");
> System.err.println("FieldType for " + f1.stringValue() + " : " + 
> f1.fieldType());
> /* retrieve all documents and inspect IndexableFieldType */
> IndexSearcher s = new IndexSearcher(DirectoryReader.open(index));
> TopDocs td = s.search(new MatchAllDocsQuery(), 1);
> for (ScoreDoc sd : td.scoreDocs) {
> Document d2 = s.doc(sd.doc);
> IndexableField f2 = d2.getField("isbn");
> System.err.println("FieldType for " + f2.stringValue() + " : " + 
> f2.fieldType());
> }
> {code}
> Produces:
> {code}
> FieldType for 9900333X : stored,indexed,omitNorms,indexOptions=DOCS
> FieldType for 9900333X : stored,indexed,tokenized,omitNorms,indexOptions=DOCS
> {code}
> The {{StringField}} field {{isbn}} is not tokenized, as correctly reported by 
> the first output, which happens right after closing the writer.
> However, it becomes tokenized when the index is re-opened with a new reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7171) IndexableField changes its IndexableFieldType when the index is re-opened for reading

2016-05-23 Thread Roberto Cornacchia (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296289#comment-15296289
 ] 

Roberto Cornacchia commented on LUCENE-7171:


Perhaps I can reformulate this more concisely as:

Why, in {{DocumentStoredFieldVisitor}}, {{StringFiield}} is arbitrarily 
converted into {{TextFiield}}? What is the point of having them as different 
classes if they are swapped under the hood?

This looks like a quick patch to the fact that no {{textField()}} method is 
present in {{StoredFieldVisitor}}.

{code}
  @Override
  public void stringField(FieldInfo fieldInfo, byte[] value) throws IOException 
{
final FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setStoreTermVectors(fieldInfo.hasVectors());
ft.setOmitNorms(fieldInfo.omitsNorms());
ft.setIndexOptions(fieldInfo.getIndexOptions());
doc.add(new Field(fieldInfo.name, new String(value, 
StandardCharsets.UTF_8), ft));
  }
{code}


> IndexableField changes its IndexableFieldType when the index is re-opened for 
> reading
> -
>
> Key: LUCENE-7171
> URL: https://issues.apache.org/jira/browse/LUCENE-7171
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.5
>Reporter: Roberto Cornacchia
>
> This code:
> {code}
> /* Store one document into an index */
> Directory index = new RAMDirectory();
> IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
> IndexWriter w = new IndexWriter(index, config);
> Document d1 = new Document();
> d1.add(new StringField("isbn", "9900333X", Field.Store.YES));
> w.addDocument(d1);
> w.commit();
> w.close();
> /* inspect IndexableFieldType */
> IndexableField f1 = d1.getField("isbn");
> System.err.println("FieldType for " + f1.stringValue() + " : " + 
> f1.fieldType());
> /* retrieve all documents and inspect IndexableFieldType */
> IndexSearcher s = new IndexSearcher(DirectoryReader.open(index));
> TopDocs td = s.search(new MatchAllDocsQuery(), 1);
> for (ScoreDoc sd : td.scoreDocs) {
> Document d2 = s.doc(sd.doc);
> IndexableField f2 = d2.getField("isbn");
> System.err.println("FieldType for " + f2.stringValue() + " : " + 
> f2.fieldType());
> }
> {code}
> Produces:
> {code}
> FieldType for 9900333X : stored,indexed,omitNorms,indexOptions=DOCS
> FieldType for 9900333X : stored,indexed,tokenized,omitNorms,indexOptions=DOCS
> {code}
> The {{StringField}} field {{isbn}} is not tokenized, as correctly reported by 
> the first output, which happens right after closing the writer.
> However, it becomes tokenized when the index is re-opened with a new reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7171) IndexableField changes its IndexableFieldType when the index is re-opened for reading

2016-04-04 Thread Roberto Cornacchia (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224436#comment-15224436
 ] 

Roberto Cornacchia commented on LUCENE-7171:


I've been pointed at this bit of documentation for {{IndexReader.document(int 
dicID)}}:
{quote}
NOTE: only the content of a field is returned, if that field was stored during 
indexing. Metadata like boost, omitNorm, IndexOptions, tokenized, etc., are not 
preserved.
{quote}

This explains what I've reported. But I find it hard not to consider this a 
design flaw. 

If I take the retrieved document and store it into a new index, I would expect 
this document to be the same as the one stored in the first index. It doesn't 
matter where it's stored. Those properties are defined for the fields of that 
document, not for a particular index. 
However, if I now try to retrieve that same document from the second index (on 
the exact match with its isbn), it won't be found, because {{isbn}} has been 
tokenized. This is surely not intended, is it?


> IndexableField changes its IndexableFieldType when the index is re-opened for 
> reading
> -
>
> Key: LUCENE-7171
> URL: https://issues.apache.org/jira/browse/LUCENE-7171
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.5
>Reporter: Roberto Cornacchia
>
> This code:
> {code}
> /* Store one document into an index */
> Directory index = new RAMDirectory();
> IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
> IndexWriter w = new IndexWriter(index, config);
> Document d1 = new Document();
> d1.add(new StringField("isbn", "9900333X", Field.Store.YES));
> w.addDocument(d1);
> w.commit();
> w.close();
> /* inspect IndexableFieldType */
> IndexableField f1 = d1.getField("isbn");
> System.err.println("FieldType for " + f1.stringValue() + " : " + 
> f1.fieldType());
> /* retrieve all documents and inspect IndexableFieldType */
> IndexSearcher s = new IndexSearcher(DirectoryReader.open(index));
> TopDocs td = s.search(new MatchAllDocsQuery(), 1);
> for (ScoreDoc sd : td.scoreDocs) {
> Document d2 = s.doc(sd.doc);
> IndexableField f2 = d2.getField("isbn");
> System.err.println("FieldType for " + f2.stringValue() + " : " + 
> f2.fieldType());
> }
> {code}
> Produces:
> {code}
> FieldType for 9900333X : stored,indexed,omitNorms,indexOptions=DOCS
> FieldType for 9900333X : stored,indexed,tokenized,omitNorms,indexOptions=DOCS
> {code}
> The {{StringField}} field {{isbn}} is not tokenized, as correctly reported by 
> the first output, which happens right after closing the writer.
> However, it becomes tokenized when the index is re-opened with a new reader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org