Sort, Search & Facets

2014-07-07 Thread Sandeep Khanzode
Hi,
 
I am using Lucene 4.7.2 and my primary use case for Lucene is to do three 
things: (a) search, (b) sort by a number of fields for the search results, and 
(c) facet on probably an equal number of fields (probably the most standard use 
cases anyway).

Let us say, I have a corpus of more than a 100m docs with each document having 
approx. 10-15 fields excluding the content (body) which will also be one of the 
fields. Out of 10-15, I have a requirement to have sorting enabled on all 10-15 
and the facets as well. That makes a total of approx. ~45 fields to be indexed 
for various reasons, once for String/Long/TextField, once for 
SortedDocValuesField, and once for FacetField each. 

What will be the impact of this on the indexing operation w.r.t. the time taken 
as well as the extra disk space required? Will it grow linearly with the 
increase in the number of fields?

What is the impact on the memory usage during search time?


I will attempt to benchmark some of these, but if you have any experience with 
this, request you to share the details. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

Why hit is 0 for bigrams?

2014-07-07 Thread Manjula Wijewickrema
Hi,

I tried to index bigrams from a documhe system gave and the system gave me
the following output with the frequencies of the bigrams(output 1):

array size:15
array terms are:{contents: /1, assist librarian/1, assist manjula/2, assist
sabaragamuwa/1, fine manjula/1, librari manjula/1, librarian
sabaragamuwa/1, main librari/2, manjula assist/4, manjula fine/1, manjula
name/1, name manjula/1, sabaragamuwa univers/3, univers main/2, univers
sabaragamuwa/1}

For this I used the follwing code in the createIndex() class:


ShingleAnalyzerWrapper sw=*new *ShingleAnalyzerWrapper(analyzer,2);

sw.setOutputUnigrams(*false*);



Then I tried search the indexed bigrams of the same document using the
following code in searchIndex()class:


IndexReader indexReader = IndexReader.open(directory);

IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);



Analyzer analyzer = *new* WhitespaceAnalyzer();



QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);



Query query = queryParser.parse(terms[pos[freqs.length-q1]]);



System.*out*.println("Query: " +query);



Hits hits = indexSearcher.search(query);

System.*out*.println("Number of hits: " + hits.length());




For this, the system gave me the following output (output2):


Query: contents:manjula contents:assist

Number of hits: 0

Query: contents:sabaragamuwa contents:univers

Number of hits: 0

Query: contents:univers contents:main

Number of hits: 0

Query: contents:main contents:librari

Number of hits: 0


If someone can please explain me;


(1)why 'contents: /1' is included in the array as an array element? (output
1)


(2) why the system return me the query as 'contents:manjula
contents:assist' instead of 'manjula assist'? (output 2)


(3) why the number of hits given as 0 instead of their frequencies? (output
2)


I highly appreciate your kind reply.


Manjula.


Re: How to handle words that stem to stop words

2014-07-07 Thread David Murgatroyd
Arjen,

An approach requiring less list maintenance could be more advanced
linguistic processing to distinguish the stop word from the content word,
such as lemmatization rather than stemming.

A commercial offering, Rosette Search Essentials from Basis
 (full disclosure: my
employer), which is free for development use and can be downloaded via that
link, uses textual context to disambiguate lemmas as in the screenshot
below -- compare the lemma for token #13 (van) v. token #25 (vans). (I
don't read/write Dutch; I took these snippets from the web.) The work
integrating OpenNLP 
might also prove helpful.

Best,
David Murgatroyd
ww.linkedin.com/in/dmurga/ 

[image: Inline image 1]

On Mon, Jul 7, 2014 at 5:53 PM, Sujit Pal  wrote:

> Hi Arjen,
>
> You could also mark a token as "keyword" so the stemmer passes it through
> unchanged. For example, per the Javadocs for PorterStemFilter:
>
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html
>
> Note: This filter is aware of the KeywordAttribute
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true
> >.
> To prevent certain terms from being passed to the stemmer
> KeywordAttribute.isKeyword()
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
> >
> should
> be set to true in a previousTokenStream
> <
> http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true
> >.
> Note: For including the original term as well as the stemmed version, see
> KeywordRepeatFilterFactory
> <
> http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html
> >
>
> Assuming your stemmer is also keyword attribute aware, you could build a
> filter that reads a list of words (such as "vans") that should be protected
> from stemming and marks them with the KeywordAttribute before sending to
> the Porter stemmer and put it into your analysis chain.
>
> -sujit
>
>
> On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao  wrote:
>
> > I think emitting two tokens for "vans" is the right (potentially only)
> way
> > to do it. You could
> > also control the dictionary of terms that require this special treatment.
> >
> > Any reason makes you not happy with this approach?
> >
> > On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden <
> > acmmail...@tweakers.net> wrote:
> >
> > Hello list,
> >
> > We have a fairly large Lucene database for a 30+ million post forum.
> > Users post and search for all kinds of things. To make sure users don't
> > have to type exact matches, we combine a WordDelimiterFilter with a
> > (Dutch) SnowballFilter.
> >
> > Unfortunately users sometimes find examples of words that get stemmed to
> > a word that's basically a stop word. Or reversely, where a very common
> > word is stemmed so that it becomes the same as a rare word.
> >
> > We do index stop words, so theoretically they could still find their
> > result. But when a rare word is stemmed in such a way it yields a
> > million hits, that makes it very unusable...
> >
> > One example is the Dutch word 'van' which is the equivalent of 'of' in
> > English. A user tried to search for the shoe brand 'vans', which gets
> > stemmed to 'van' and obviously gives useless results.
> >
> > I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
> > and 'van' and the StemmerOverrideFilter to try and prevent these cases.
> > Are there any other solutions for these kinds of problems?
> >
> > Best regards,
> >
> > Arjen van der Meijden
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: How to handle words that stem to stop words

2014-07-07 Thread Sujit Pal
Hi Arjen,

You could also mark a token as "keyword" so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html

Note: This filter is aware of the KeywordAttribute
.
To prevent certain terms from being passed to the stemmer
KeywordAttribute.isKeyword()

should
be set to true in a previousTokenStream
.
Note: For including the original term as well as the stemmed version, see
KeywordRepeatFilterFactory


Assuming your stemmer is also keyword attribute aware, you could build a
filter that reads a list of words (such as "vans") that should be protected
from stemming and marks them with the KeywordAttribute before sending to
the Porter stemmer and put it into your analysis chain.

-sujit


On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao  wrote:

> I think emitting two tokens for "vans" is the right (potentially only) way
> to do it. You could
> also control the dictionary of terms that require this special treatment.
>
> Any reason makes you not happy with this approach?
>
> On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden <
> acmmail...@tweakers.net> wrote:
>
> Hello list,
>
> We have a fairly large Lucene database for a 30+ million post forum.
> Users post and search for all kinds of things. To make sure users don't
> have to type exact matches, we combine a WordDelimiterFilter with a
> (Dutch) SnowballFilter.
>
> Unfortunately users sometimes find examples of words that get stemmed to
> a word that's basically a stop word. Or reversely, where a very common
> word is stemmed so that it becomes the same as a rare word.
>
> We do index stop words, so theoretically they could still find their
> result. But when a rare word is stemmed in such a way it yields a
> million hits, that makes it very unusable...
>
> One example is the Dutch word 'van' which is the equivalent of 'of' in
> English. A user tried to search for the shoe brand 'vans', which gets
> stemmed to 'van' and obviously gives useless results.
>
> I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
> and 'van' and the StemmerOverrideFilter to try and prevent these cases.
> Are there any other solutions for these kinds of problems?
>
> Best regards,
>
> Arjen van der Meijden
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: How to handle words that stem to stop words

2014-07-07 Thread Jack Krupansky
Some of these anomalous cases are best handled by simply suppressing 
stemming, using PatternKeywordMarkerFilter and SetKeywordMarkerFilter, to 
set the keyword attribute for matching tokens and then most stemmers will 
not change them.


You can create a list of words to ignore, like plurals of your stop words, 
or possibly a pattern that matches stop words plus a short suffix that might 
get stemmed.


-- Jack Krupansky

-Original Message- 
From: Arjen van der Meijden

Sent: Sunday, July 6, 2014 2:47 PM
To: java-user@lucene.apache.org
Subject: How to handle words that stem to stop words

Hello list,

We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to handle words that stem to stop words

2014-07-07 Thread Tri Cao

I think emitting two tokens for "vans" is the right (potentially only) way to 
do it. You could
also control the dictionary of terms that require this special treatment.

Any reason makes you not happy with this approach?

On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden  
wrote:

Hello list,

We have a fairly large Lucene database for a 30+ million post forum. 
Users post and search for all kinds of things. To make sure users don't 
have to type exact matches, we combine a WordDelimiterFilter with a 
(Dutch) SnowballFilter.


Unfortunately users sometimes find examples of words that get stemmed to 
a word that's basically a stop word. Or reversely, where a very common 
word is stemmed so that it becomes the same as a rare word.


We do index stop words, so theoretically they could still find their 
result. But when a rare word is stemmed in such a way it yields a 
million hits, that makes it very unusable...


One example is the Dutch word 'van' which is the equivalent of 'of' in 
English. A user tried to search for the shoe brand 'vans', which gets 
stemmed to 'van' and obviously gives useless results.


I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' 
and 'van' and the StemmerOverrideFilter to try and prevent these cases. 
Are there any other solutions for these kinds of problems?


Best regards,

Arjen van der Meijden

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocIDs from Facet Results

2014-07-07 Thread Jigar Shah
I think, you need to execute DrilDownQuery to get the docIds.


On Mon, Jul 7, 2014 at 4:40 PM, Sandeep Khanzode <
sandeep_khanz...@yahoo.com.invalid> wrote:

> Hi,
>
> For Lucene 4.7.2 Facets, once we invoke FacetCollector and get the
> topNChildren into FacetResult, is there any mechanism that for a particular
> search result, I could get the docIds corresponding to any facet?
>
> Say, I have a facet defined on Field1. Upon Search and FacetCollection, I
> get FVal1, FVal2, and FVal3 as top3Children along with their corresponding
> counts. Can I look into (a) Field1 and get all docIDs, or (b) FVal1 or
> FVal2 or FVal3 and get their corresponding docIds?
>
> ---
> Thanks n Regards,
> Sandeep Ramesh Khanzode


DocIDs from Facet Results

2014-07-07 Thread Sandeep Khanzode
Hi,

For Lucene 4.7.2 Facets, once we invoke FacetCollector and get the topNChildren 
into FacetResult, is there any mechanism that for a particular search result, I 
could get the docIds corresponding to any facet?

Say, I have a facet defined on Field1. Upon Search and FacetCollection, I get 
FVal1, FVal2, and FVal3 as top3Children along with their corresponding counts. 
Can I look into (a) Field1 and get all docIDs, or (b) FVal1 or FVal2 or FVal3 
and get their corresponding docIds?
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode