Hello,

I am observing a strange behavior of CLucene with large data (though its
not that large).

I have 40,000 HTML documents (around 5GB of data). I added these documents
in Lucene Index. When I try to search a word with this index it gives me
zero results.

If I take subset of these documents (only 170 documents) and create a Index
then the same search works.

Note, to create above both Index I used the same the same code.

Here is what I am doing, to add an string in index. (Note I am passing the
document contents as string).

void LuceneLib::AddStringToDoc(Document *doc, const char *fieldName, const
char *str)
{
wchar_t *wstr = charToWChar(fieldName);
wchar_t *wstr2 = charToWChar(str);

bool isHighlighted = false;
bool isStoreCompressed = false;

for (int i =0; i < highlightedFields.size(); i++)
{
if (highlightedFields.at(i).compare(fieldName) == 0) {
isHighlighted = true;
break;
}
}

for (int i =0; i < compressedFields.size(); i++)
{
if (compressedFields.at(i).compare(fieldName) == 0) {
isStoreCompressed = true;
break;
}
}

cout << "Field : " << fieldName << " ";
int fieldConfig = Field::INDEX_TOKENIZED;

if (isHighlighted == true) {
fieldConfig = fieldConfig | Field::TERMVECTOR_WITH_POSITIONS_OFFSETS;
cout << " Highlighted";
}

if (isStoreCompressed == true) {
fieldConfig = fieldConfig | Field::STORE_COMPRESS;
cout << " Store Compressed";
}
else {
fieldConfig = fieldConfig | Field::STORE_NO;
cout << " Do not store";
}
cout << " : " << fieldConfig << endl;

Field *field = _CLNEW Field((const TCHAR *) wstr, (const TCHAR *) wstr2,
fieldConfig);
doc->add(*field);

delete[] wstr;
delete[] wstr2;
}


I checked the field config values and those are as below:
Field : docName  Do not store : 34
Field : docPath  Do not store : 34
Field : docContent  Highlighted Store Compressed : 3620
Field : All  Do not store : 34


The field on which I am doing a query is docContent.

Please let me know if I have missed anything.

Thanks,
  Shailesh
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to