[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-09-26 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114484#comment-13114484
 ] 

Koji Sekiguchi commented on LUCENE-3234:


Ok, thank you, David. I opened SOLR-2794.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, 
 LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-09-25 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114157#comment-13114157
 ] 

David Smiley commented on LUCENE-3234:
--

Why is the default unlimited; shouldn't it be the suggested 5000?  I doubt it 
could be for backwards compatibility since I can't see how an app might depend 
on the unlimited behavior. I think Solr should have good defaults that protect 
against pathological cases. Other defaults in Lucene/Solr have such defaults, 
in general (e.g. hl.maxAnalyzedChars).

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, 
 LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-09-25 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114233#comment-13114233
 ] 

Koji Sekiguchi commented on LUCENE-3234:


It could be the suggested 5000. I don't have a persistence on it. The current 
default is just for back-compat.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, 
 LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-09-25 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114472#comment-13114472
 ] 

David Smiley commented on LUCENE-3234:
--

Koji; you didn't react to my comment that an app would be incredibly unlikely 
to actually depend on the unlimited behavior; thus there is no back-compat.  
Again, I think it should be 5000.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, 
 LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-26 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055301#comment-13055301
 ] 

Mike Sokolov commented on LUCENE-3234:
--

Thank you, Koji - it's nice to have my first patch committed!

um - one little comment; since you made the default be MAX_VALUE, there is a 
javadoc comment that should be updated which says it is 5000.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, 
 LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-26 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055309#comment-13055309
 ] 

Koji Sekiguchi commented on LUCENE-3234:


Thank you again for checking the commit, Mike! The javadoc has been fixed.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, 
 LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-25 Thread Digy (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054935#comment-13054935
 ] 

Digy commented on LUCENE-3234:
--

I am not sure how much it is related to this issue but there was
a similar issue in Lucene.Net.
https://issues.apache.org/jira/browse/LUCENENET-350



 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054326#comment-13054326
 ] 

Robert Muir commented on LUCENE-3234:
-

Oh I see, I think i'm nervous about testRepeatedTerms too.
Maybe we can comment it out and just mention its more of a benchmark?

The problem could be that the test is timing-based... in general a machine 
could suddenly get busy at any time,
especially since we run many tests in parallel, so I'm worried it could 
intermittently fail.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
Assignee: Koji Sekiguchi
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch, LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054007#comment-13054007
 ] 

Robert Muir commented on LUCENE-3234:
-

I like this tradeoff Mike, thanks!

should we consider setting some kind of absurd default like 10,000 to really 
prevent some pathological cases with huge documents?
We could document in CHANGES.txt that if you want the old behavior, set it to 
-1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?)

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054014#comment-13054014
 ] 

Robert Muir commented on LUCENE-3234:
-

yeah, you are right.. but seeing as how positions are ints too, I think it 
might be easier to do Integer.MAX_VALUE versus the -1 parameter.


 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054010#comment-13054010
 ] 

Mike Sokolov commented on LUCENE-3234:
--

Yes, although a smaller number might be fine.  Maybe Koji will comment: I don't 
completely understand the scaling here, but it seemed to me that I had a case 
with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time 
on my desktop.  The max value will be an int, sire, although I think the number 
is going to scale like positions, not offsets FWIW.


 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054093#comment-13054093
 ] 

Mike Sokolov commented on LUCENE-3234:
--

Yes, that makes sense to me - default to 5000, say, and set explicitly to 
either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 
since otherwise you should probably treat it as an error).  Do you want me to 
change the patch, or should I just leave that to the committer?

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054114#comment-13054114
 ] 

Robert Muir commented on LUCENE-3234:
-

You can change it if you don't mind. However, I think I agree it would be good 
to figure out if there is an n^2 here. This might have some affect on what the 
default value should be... ideally there is some way we could fix the n^2.

Is there a way to turn your test case into a benchmark, or do you have a 
separate benchmark (the example you mentioned where it blows up really bad). 
This could help in looking at what's going on.


 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054125#comment-13054125
 ] 

Mike Sokolov commented on LUCENE-3234:
--

I don't think I can share the test documents I have - they belong to someone 
else.  I can look at trying to make something bad happen with the wikipedia 
data, but I'm curious why a benchmark is preferable to a test case? 

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054133#comment-13054133
 ] 

Robert Muir commented on LUCENE-3234:
-

oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that 
we might think might be n^2.
If the little test case does that... then that will work, just wasn't sure if 
it did.

either way just something to look at in the profiler, etc.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Mike Sokolov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054162#comment-13054162
 ] 

Mike Sokolov commented on LUCENE-3234:
--

I did go back and look at the original case that made me worried; in that case 
the bad document is 650K, and the matched term occurs 23000 times in it.  The 
search still finishes in 24 sec or so on my desktop, which isn't too bad I 
guess, considering.

After looking at that and measuring the change in the test case in the patch as 
the number of terms increase, I don't think there actually is an n^2 - just 
linear, but the growth is still enough that the patch has value. The test case 
in the patch is closely targeted at the method which takes all the time when 
you have large numbers of matching terms in a single document.

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mike Sokolov
 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter

2011-06-23 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054193#comment-13054193
 ] 

Koji Sekiguchi commented on LUCENE-3234:


Mike, thank you for your continuous interest to FVH! Can you add the parameter 
for Solr, with an appropriate default value if you would like. I don't know 
assertTrue test in testManyRepeatedTerms() is ok, for JENKINS?

 Provide limit on phrase analysis in FastVectorHighlighter
 -

 Key: LUCENE-3234
 URL: https://issues.apache.org/jira/browse/LUCENE-3234
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3
Reporter: Mike Sokolov
 Fix For: 3.4, 4.0

 Attachments: LUCENE-3234.patch


 With larger documents, FVH can spend a lot of time trying to find the 
 best-scoring snippet as it examines every possible phrase formed from 
 matching terms in the document.  If one is willing to accept
 less-than-perfect scoring by limiting the number of phrases that are 
 examined, substantial speedups are possible.  This is analogous to the 
 Highlighter limit on the number of characters to analyze.
 The patch includes an artifical test case that shows  1000x speedup.  In a 
 more normal test environment, with English documents and random queries, I am 
 seeing speedups of around 3-10x when setting phraseLimit=1, which has the 
 effect of selecting the first possible snippet in the document.  Most of our 
 sites operate in this way (just show the first snippet), so this would be a 
 big win for us.
 With phraseLimit = -1, you get the existing FVH behavior. At larger values of 
 phraseLimit, you may not get substantial speedup in the normal case, but you 
 do get the benefit of protection against blow-up in pathological cases.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org