[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114484#comment-13114484 ] Koji Sekiguchi commented on LUCENE-3234: Ok, thank you, David. I opened SOLR-2794. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114157#comment-13114157 ] David Smiley commented on LUCENE-3234: -- Why is the default unlimited; shouldn't it be the suggested 5000? I doubt it could be for backwards compatibility since I can't see how an app might depend on the unlimited behavior. I think Solr should have good defaults that protect against pathological cases. Other defaults in Lucene/Solr have such defaults, in general (e.g. hl.maxAnalyzedChars). Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114233#comment-13114233 ] Koji Sekiguchi commented on LUCENE-3234: It could be the suggested 5000. I don't have a persistence on it. The current default is just for back-compat. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114472#comment-13114472 ] David Smiley commented on LUCENE-3234: -- Koji; you didn't react to my comment that an app would be incredibly unlikely to actually depend on the unlimited behavior; thus there is no back-compat. Again, I think it should be 5000. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055301#comment-13055301 ] Mike Sokolov commented on LUCENE-3234: -- Thank you, Koji - it's nice to have my first patch committed! um - one little comment; since you made the default be MAX_VALUE, there is a javadoc comment that should be updated which says it is 5000. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055309#comment-13055309 ] Koji Sekiguchi commented on LUCENE-3234: Thank you again for checking the commit, Mike! The javadoc has been fixed. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054935#comment-13054935 ] Digy commented on LUCENE-3234: -- I am not sure how much it is related to this issue but there was a similar issue in Lucene.Net. https://issues.apache.org/jira/browse/LUCENENET-350 Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054326#comment-13054326 ] Robert Muir commented on LUCENE-3234: - Oh I see, I think i'm nervous about testRepeatedTerms too. Maybe we can comment it out and just mention its more of a benchmark? The problem could be that the test is timing-based... in general a machine could suddenly get busy at any time, especially since we run many tests in parallel, so I'm worried it could intermittently fail. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Assignee: Koji Sekiguchi Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch, LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054007#comment-13054007 ] Robert Muir commented on LUCENE-3234: - I like this tradeoff Mike, thanks! should we consider setting some kind of absurd default like 10,000 to really prevent some pathological cases with huge documents? We could document in CHANGES.txt that if you want the old behavior, set it to -1 or Integer.MAX_VALUE (I think we can use this here? offsets are ints?) Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054014#comment-13054014 ] Robert Muir commented on LUCENE-3234: - yeah, you are right.. but seeing as how positions are ints too, I think it might be easier to do Integer.MAX_VALUE versus the -1 parameter. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054010#comment-13054010 ] Mike Sokolov commented on LUCENE-3234: -- Yes, although a smaller number might be fine. Maybe Koji will comment: I don't completely understand the scaling here, but it seemed to me that I had a case with around 2000 occurrences of a term that lead to a 15-20 sec evaluation time on my desktop. The max value will be an int, sire, although I think the number is going to scale like positions, not offsets FWIW. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054093#comment-13054093 ] Mike Sokolov commented on LUCENE-3234: -- Yes, that makes sense to me - default to 5000, say, and set explicitly to either MAX_VALUE or -1 to get the unlimited behavior (I prefer to allow -1 since otherwise you should probably treat it as an error). Do you want me to change the patch, or should I just leave that to the committer? Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054114#comment-13054114 ] Robert Muir commented on LUCENE-3234: - You can change it if you don't mind. However, I think I agree it would be good to figure out if there is an n^2 here. This might have some affect on what the default value should be... ideally there is some way we could fix the n^2. Is there a way to turn your test case into a benchmark, or do you have a separate benchmark (the example you mentioned where it blows up really bad). This could help in looking at what's going on. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054125#comment-13054125 ] Mike Sokolov commented on LUCENE-3234: -- I don't think I can share the test documents I have - they belong to someone else. I can look at trying to make something bad happen with the wikipedia data, but I'm curious why a benchmark is preferable to a test case? Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054133#comment-13054133 ] Robert Muir commented on LUCENE-3234: - oh thats ok, i just meant a little tiny benchmark, hitting the nasty case that we might think might be n^2. If the little test case does that... then that will work, just wasn't sure if it did. either way just something to look at in the profiler, etc. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054162#comment-13054162 ] Mike Sokolov commented on LUCENE-3234: -- I did go back and look at the original case that made me worried; in that case the bad document is 650K, and the matched term occurs 23000 times in it. The search still finishes in 24 sec or so on my desktop, which isn't too bad I guess, considering. After looking at that and measuring the change in the test case in the patch as the number of terms increase, I don't think there actually is an n^2 - just linear, but the growth is still enough that the patch has value. The test case in the patch is closely targeted at the method which takes all the time when you have large numbers of matching terms in a single document. Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Reporter: Mike Sokolov Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3234) Provide limit on phrase analysis in FastVectorHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054193#comment-13054193 ] Koji Sekiguchi commented on LUCENE-3234: Mike, thank you for your continuous interest to FVH! Can you add the parameter for Solr, with an appropriate default value if you would like. I don't know assertTrue test in testManyRepeatedTerms() is ok, for JENKINS? Provide limit on phrase analysis in FastVectorHighlighter - Key: LUCENE-3234 URL: https://issues.apache.org/jira/browse/LUCENE-3234 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9.4, 3.0.3, 3.1, 3.2, 3.3 Reporter: Mike Sokolov Fix For: 3.4, 4.0 Attachments: LUCENE-3234.patch With larger documents, FVH can spend a lot of time trying to find the best-scoring snippet as it examines every possible phrase formed from matching terms in the document. If one is willing to accept less-than-perfect scoring by limiting the number of phrases that are examined, substantial speedups are possible. This is analogous to the Highlighter limit on the number of characters to analyze. The patch includes an artifical test case that shows 1000x speedup. In a more normal test environment, with English documents and random queries, I am seeing speedups of around 3-10x when setting phraseLimit=1, which has the effect of selecting the first possible snippet in the document. Most of our sites operate in this way (just show the first snippet), so this would be a big win for us. With phraseLimit = -1, you get the existing FVH behavior. At larger values of phraseLimit, you may not get substantial speedup in the normal case, but you do get the benefit of protection against blow-up in pathological cases. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org