[jira] [Created] (SOLR-4253) Misleading resource loading warning from Carrot2 clustering component
Stanislaw Osinski created SOLR-4253: --- Summary: Misleading resource loading warning from Carrot2 clustering component Key: SOLR-4253 URL: https://issues.apache.org/jira/browse/SOLR-4253 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 4.0 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.1 {{SolrResourceLoader.openResource(String)}} now throws only {{IOException}}, which causes the clustering component to issue resource loading warnings even if the fallback resources from Carrot2 JAR is available. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-4253) Misleading resource loading warning from Carrot2 clustering component
[ https://issues.apache.org/jira/browse/SOLR-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-4253. - Resolution: Fixed Fixed in trunk and 4.x branch. Misleading resource loading warning from Carrot2 clustering component - Key: SOLR-4253 URL: https://issues.apache.org/jira/browse/SOLR-4253 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 4.0 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.1 {{SolrResourceLoader.openResource(String)}} now throws only {{IOException}}, which causes the clustering component to issue resource loading warnings even if the fallback resources from Carrot2 JAR is available. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-3279) Upgrade Carrot2 to minimize the possibility of dependency clashes
[ https://issues.apache.org/jira/browse/SOLR-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-3279. - Resolution: Fixed Carrot2 upgraded to 3.6.2 in trunk and 4.x branch. NB: Carrot2 3.6.2 stock binaries ship with Guava r12, but r13 (currently in Solr) is backwards compatible. If at some point upgrade to r14 is needed, it will most likely be possible without upgrading Carrot2. Upgrade Carrot2 to minimize the possibility of dependency clashes - Key: SOLR-3279 URL: https://issues.apache.org/jira/browse/SOLR-3279 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.1 When we get closer to the 4.0 release, update Carrot2 to the then newest version so that the dependencies get a refresh too (re: http://lucene.472066.n3.nabble.com/Old-Google-Guava-library-needs-updating-r05-td3854433.html). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3279) Upgrade Carrot2 to minimize the possibility of dependency clashes
[ https://issues.apache.org/jira/browse/SOLR-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541640#comment-13541640 ] Stanislaw Osinski commented on SOLR-3279: - It's high time we upgraded, I'll take a look at this tomorrow. Upgrade Carrot2 to minimize the possibility of dependency clashes - Key: SOLR-3279 URL: https://issues.apache.org/jira/browse/SOLR-3279 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.1 When we get closer to the 4.0 release, update Carrot2 to the then newest version so that the dependencies get a refresh too (re: http://lucene.472066.n3.nabble.com/Old-Google-Guava-library-needs-updating-r05-td3854433.html). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
[ https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280023#comment-13280023 ] Stanislaw Osinski commented on SOLR-3470: - Not pretty indeed, but still better than hardcoding Carrot2 attribute names. I'll commit this in a moment. Custom Carrot2 tokenizer and stemmer factories overwritten by defaults -- Key: SOLR-3470 URL: https://issues.apache.org/jira/browse/SOLR-3470 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.6 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.0, 3.6.1 Attachments: SOLR-3470.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
[ https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-3470. - Resolution: Fixed Dawid's patch committed to trunk and 3.6 branch. Custom Carrot2 tokenizer and stemmer factories overwritten by defaults -- Key: SOLR-3470 URL: https://issues.apache.org/jira/browse/SOLR-3470 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.6 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.0, 3.6.1 Attachments: SOLR-3470.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
Stanislaw Osinski created SOLR-3470: --- Summary: Custom Carrot2 tokenizer and stemmer factories overwritten by defaults Key: SOLR-3470 URL: https://issues.apache.org/jira/browse/SOLR-3470 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.6 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.6.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
[ https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-3470: Fix Version/s: 4.0 Custom Carrot2 tokenizer and stemmer factories overwritten by defaults -- Key: SOLR-3470 URL: https://issues.apache.org/jira/browse/SOLR-3470 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.6 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.0, 3.6.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
[ https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-3470. - Resolution: Fixed Fixed in trunk and 3.6.1 branch. Custom Carrot2 tokenizer and stemmer factories overwritten by defaults -- Key: SOLR-3470 URL: https://issues.apache.org/jira/browse/SOLR-3470 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.6 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.0, 3.6.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
[ https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski reopened SOLR-3470: - Unit tests pass fine, but Carrot2's internal class resolution code (context class loader) doesn't play well with how Solr loads contrib classes in webapp mode. A brute-force fix would be to do the class loading the Solr way in the clustering component and pass class objects instead of strings to Carrot2. Custom Carrot2 tokenizer and stemmer factories overwritten by defaults -- Key: SOLR-3470 URL: https://issues.apache.org/jira/browse/SOLR-3470 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.6 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 4.0, 3.6.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2706) The carrot.lexicalResourcesDir parameter does not work with absolute directories
The carrot.lexicalResourcesDir parameter does not work with absolute directories Key: SOLR-2706 URL: https://issues.apache.org/jira/browse/SOLR-2706 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.3, 3.2 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.4, 4.0 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078632#comment-13078632 ] Stanislaw Osinski commented on SOLR-1692: - Looking at the code, the issue is resolved, summaries (from highlighter) are used for clustering when configured. I see there's no unit test for the feature though, so I can write one and resolve the issue. CarrotClusteringEngine produce summary does nothing --- Key: SOLR-1692 URL: https://issues.apache.org/jira/browse/SOLR-1692 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Grant Ingersoll Fix For: 3.4, 4.0 Attachments: SOLR-1692.patch In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski reassigned SOLR-1692: --- Assignee: Stanislaw Osinski CarrotClusteringEngine produce summary does nothing --- Key: SOLR-1692 URL: https://issues.apache.org/jira/browse/SOLR-1692 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Stanislaw Osinski Fix For: 3.4, 4.0 Attachments: SOLR-1692.patch In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2692) Typo in clustering fragment size param name
Typo in clustering fragment size param name --- Key: SOLR-2692 URL: https://issues.apache.org/jira/browse/SOLR-2692 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.3 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.1.1, 3.4 The param should be {{carrot.fragSize}} but it's {{carrot.fragzise}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2692) Typo in clustering fragment size param name
[ https://issues.apache.org/jira/browse/SOLR-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2692: Fix Version/s: (was: 3.1.1) I mistook 3.1.1 for 3.3.1. Typo in clustering fragment size param name --- Key: SOLR-2692 URL: https://issues.apache.org/jira/browse/SOLR-2692 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.3 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.4 The param should be {{carrot.fragSize}} but it's {{carrot.fragzise}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-1692. - Resolution: Fixed Fix Version/s: (was: 3.4) (was: 4.0) 3.1 This issue was really fixed for 3.1.0 and documented in CHANGES under that release. It doesn't make sense to complicate things further as I suggested in the discussion above, so resolving. CarrotClusteringEngine produce summary does nothing --- Key: SOLR-1692 URL: https://issues.apache.org/jira/browse/SOLR-1692 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Stanislaw Osinski Fix For: 3.1 Attachments: SOLR-1692.patch In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2692) Typo in clustering fragment size param name
[ https://issues.apache.org/jira/browse/SOLR-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2692. - Resolution: Fixed Fixed in trunk and branch_3x, ClusteringComponent wiki updated to warn the users of this bug. Typo in clustering fragment size param name --- Key: SOLR-2692 URL: https://issues.apache.org/jira/browse/SOLR-2692 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 3.3 Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.4 The param should be {{carrot.fragSize}} but it's {{carrot.fragzise}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2561) SimpleXML notice is a copy of mahout-math notice
SimpleXML notice is a copy of mahout-math notice Key: SOLR-2561 URL: https://issues.apache.org/jira/browse/SOLR-2561 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Trivial Fix For: 3.3, 4.0 The note should probably say something like: This product includes software developed by the SimpleXML project (http://simple.sourceforge.net/). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033919#comment-13033919 ] Stanislaw Osinski commented on SOLR-2448: - Hi, if there are no objections, I'd like to commit this patch later today. Thanks! S. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
[ https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2450. - Resolution: Fixed Committed to trunk and branch_3x. Carrot2 clustering should use both its own and Solr's stop words Key: SOLR-2450 URL: https://issues.apache.org/jira/browse/SOLR-2450 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2450.patch While using only Solr's stop words for clustering isn't a good idea (compared to indexing, clustering needs more aggressive stop word removal to get reasonable cluster labels), it would be good if Carrot2 used both its own and Solr's stop words. I'm not sure what the best way to implement this would be though. My first thought was to simply load {{stopwords.txt}} from Solr config dir and merge them with Carrot2's. But then, maybe a better approach would be to get the stop words from the StopFilter being used? Ideally, we should also consider the per-field stop filters configured on the fields used for clustering. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2449) Loading of Carrot2 resources from Solr config directory
[ https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2449. - Resolution: Fixed Committed to trunk and branch_3x. Loading of Carrot2 resources from Solr config directory --- Key: SOLR-2449 URL: https://issues.apache.org/jira/browse/SOLR-2449 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Fix For: 3.2, 4.0 Attachments: SOLR-2449.patch Currently, Carrot2 clustering algorithms read linguistic resources (stop words, stop labels) from the classpath (Carrot2 JAR), which makes them difficult to edit/override. The directory from which Carrot2 should read its resources (absolute, or relative to Solr config dir) could be specified in the {{engine}} element. By default, the path could be e.g. {{solr.conf/clustering/carrot2}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2448. - Resolution: Fixed Committed to trunk and branch_3x. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-2505) Output cluster scores
[ https://issues.apache.org/jira/browse/SOLR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-2505. - Resolution: Fixed Committed to trunk and branch_3x. Output cluster scores - Key: SOLR-2505 URL: https://issues.apache.org/jira/browse/SOLR-2505 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Carrot2 algorithms compute cluster scores; we could expose them on the output from Solr clustering component. Along with scores, we can output a boolean flag that marks the Other Topics groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Attachment: (was: SOLR-2448-2449-2450-2505-trunk.zip) Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Attachment: carrot2-core-3.5.0.jar SOLR-2448-2449-2450-2505-trunk.patch SOLR-2448-2449-2450-2505-branch_3x.patch Hi, here's another set of patches (svn this time) against trunk and branch_3x. I've corrected Maven configs and checked that the project builds fine using mvn install. After applying the patches you'd need to manually update the JARs: In trunk, delete: trunk/solr/contrib/clustering/lib/carrot2-core-3.4.2.jar trunk/solr/contrib/clustering/lib/hppc-0.3.1.jar and replace them with new versions: http://repo1.maven.org/maven2/org/carrot2/carrot2-core/3.5.0/carrot2-core-3.5.0.jar http://repo1.maven.org/maven2/com/carrotsearch/hppc/0.3.3/hppc-0.3.3.jar In branch_3x, delete: branch_3x/solr/contrib/clustering/lib/carrot2-core-3.4.2.jar branch_3x/solr/contrib/clustering/lib/hppc-0.3.1.jar and replace them with new versions: carrot2-core-3.5.0.jar attached (jdk15 backport) http://repo1.maven.org/maven2/com/carrotsearch/hppc/0.3.4/hppc-0.3.4-jdk15.jar It'd be great if someone could review these before I make the commit. Thanks! S. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Attachment: SOLR-2448-2449-2450-2505-trunk.zip Hi, we've finally [released Carrot2 3.5.0|http://project.carrot2.org/release-3.5.0], so I'm attaching the patch (git) against Solr trunk for your review. The patch contains several separate commits related to the upgrade (SOLR-2448, SOLR-2449, SOLR-2450, SOLR-2505), I hope it will be easier to review this way. One thing I'm wondering about is Maven artifact generation that seems to be gone from trunk contribs (compared to the 3.x branch). Let me know if I need to update the dependencies/version numbers anywhere. The patch for Solr 3.x is in the works, we need to release JDK1.5-compatible version of some of the dependencies (HPPC) to make it happen. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-trunk.zip, SOLR-2448.zip Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Attachment: (was: SOLR-2448.zip) Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-trunk.zip Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Attachment: SOLR-2448-2449-2450-2505-svn.patch Hi Steven, Thanks for you help and apologies for git confusion, here's the SVN patch. After patching, you'd also need to delete: trunk/solr/contrib/clustering/lib/carrot2-core-3.4.2.jar trunk/solr/contrib/clustering/lib/hppc-0.3.1.jar and replace them with new versions: http://repo1.maven.org/maven2/org/carrot2/carrot2-core/3.5.0/carrot2-core-3.5.0.jar http://repo1.maven.org/maven2/com/carrotsearch/hppc/0.3.3/hppc-0.3.3.jar Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-svn.patch, SOLR-2448-2449-2450-2505-trunk.zip Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031186#comment-13031186 ] Stanislaw Osinski commented on SOLR-2448: - bq. So, I first tried running ant generate-maven-artifacts from solr/ on trunk, without applying your patches, and all artifacts, including contribs, are generated under solr/package/maven/. Are you using a different Ant target for Maven artifact generation? The target runs fine for me too (on the patched code). I just wanted to update the version number of the Carrot2 dependency, but couldn't find any file referencing the old number (3.4.2). Now I see that the generated solr-clustering POM XML has carrot2-core as a dependency, but does not specify the exact version number. I guess there's some more Maven magic I need to learn to understand this :-) Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-svn.patch, SOLR-2448-2449-2450-2505-trunk.zip Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031197#comment-13031197 ] Stanislaw Osinski commented on SOLR-2448: - bq. Versions for all dependencies for both Solr and Lucene are specified in one place: the grandparent POM, in the root of the sources. Everything is clear then, thanks! I'll update the version number and remove Carrot2 Maven repository, the latest Carrot2 binaries are now available from Maven central. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448-2449-2450-2505-svn.patch, SOLR-2448-2449-2450-2505-trunk.zip Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2505) Output cluster scores
Output cluster scores - Key: SOLR-2505 URL: https://issues.apache.org/jira/browse/SOLR-2505 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Carrot2 algorithms compute cluster scores; we could expose them on the output from Solr clustering component. Along with scores, we can output a boolean flag that marks the Other Topics groups. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Attachment: SOLR-2448.zip Initial patch (git) based on Carrot2 3.5.0-dev, against Solr trunk. As soon as we make the stable 3.5.0 release, I'll submit the final patch for your review. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2448.zip Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2449) Loading of Carrot2 resources from Solr config directory
[ https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2449: Attachment: SOLR-2449.patch The patch requires the SOLR-2448 patch applied. Loading of Carrot2 resources from Solr config directory --- Key: SOLR-2449 URL: https://issues.apache.org/jira/browse/SOLR-2449 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Fix For: 3.2, 4.0 Attachments: SOLR-2449.patch Currently, Carrot2 clustering algorithms read linguistic resources (stop words, stop labels) from the classpath (Carrot2 JAR), which makes them difficult to edit/override. The directory from which Carrot2 should read its resources (absolute, or relative to Solr config dir) could be specified in the {{engine}} element. By default, the path could be e.g. {{solr.conf/clustering/carrot2}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
[ https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2450: Attachment: SOLR-2450.patch Patch for the use of stop words from the field's {{StopWordFilterFactory}} and {{CommonGramsFilterFactory}} in addition to Carrot2's built-in stop words. Requires the SOLR-2448 and SOLR-2449 patches applied. Carrot2 clustering should use both its own and Solr's stop words Key: SOLR-2450 URL: https://issues.apache.org/jira/browse/SOLR-2450 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Attachments: SOLR-2450.patch While using only Solr's stop words for clustering isn't a good idea (compared to indexing, clustering needs more aggressive stop word removal to get reasonable cluster labels), it would be good if Carrot2 used both its own and Solr's stop words. I'm not sure what the best way to implement this would be though. My first thought was to simply load {{stopwords.txt}} from Solr config dir and merge them with Carrot2's. But then, maybe a better approach would be to get the stop words from the StopFilter being used? Ideally, we should also consider the per-field stop filters configured on the fields used for clustering. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013547#comment-13013547 ] Stanislaw Osinski commented on SOLR-2448: - Oh, is there any way to assign this issue to myself? It looks like I don't have this permission now. Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2449) Loading of Carrot2 resources from Solr config directory
Loading of Carrot2 resources from Solr config directory --- Key: SOLR-2449 URL: https://issues.apache.org/jira/browse/SOLR-2449 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 3.2, 4.0 Currently, Carrot2 clustering algorithms read linguistic resources (stop words, stop labels) from the classpath (Carrot2 JAR), which makes them difficult to edit/override. The directory from which Carrot2 should read its resources (absolute, or relative to Solr config dir) could be specified in the {{engine}} element. By default, the path could be e.g. {{solr.conf/clustering/carrot2}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
Carrot2 clustering should use both its own and Solr's stop words Key: SOLR-2450 URL: https://issues.apache.org/jira/browse/SOLR-2450 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 While using only Solr's stop words for clustering isn't a good idea (compared to indexing, clustering needs more aggressive stop word removal to get reasonable cluster labels), it would be good if Carrot2 used both its own and Solr's stop words. I'm not sure what the best way to implement this would be though. My first thought was to simply load {{stopwords.txt}} from Solr config dir and merge them with Carrot2's. But then, maybe a better approach would be to get the stop words from the StopFilter being used? Ideally, we should also consider the per-field stop filters configured on the fields used for clustering. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
[ https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski reassigned SOLR-2450: --- Assignee: Stanislaw Osinski Carrot2 clustering should use both its own and Solr's stop words Key: SOLR-2450 URL: https://issues.apache.org/jira/browse/SOLR-2450 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 While using only Solr's stop words for clustering isn't a good idea (compared to indexing, clustering needs more aggressive stop word removal to get reasonable cluster labels), it would be good if Carrot2 used both its own and Solr's stop words. I'm not sure what the best way to implement this would be though. My first thought was to simply load {{stopwords.txt}} from Solr config dir and merge them with Carrot2's. But then, maybe a better approach would be to get the stop words from the StopFilter being used? Ideally, we should also consider the per-field stop filters configured on the fields used for clustering. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0
[ https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2448: Assignee: Stanislaw Osinski Yes, thanks! Upgrade Carrot2 to version 3.5.0 Key: SOLR-2448 URL: https://issues.apache.org/jira/browse/SOLR-2448 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Priority: Minor Fix For: 3.2, 4.0 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it will be possible to implement a few improvements to the clustering plugin; I'll file separate issues for these. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2449) Loading of Carrot2 resources from Solr config directory
[ https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013614#comment-13013614 ] Stanislaw Osinski commented on SOLR-2449: - This is exactly how I implemented it. I'll attach a patch for review when we release and integrate Carrot2 3.5.0 (required for this improvement to work). A more interesting case though is SOLR-2450 -- any hints about the recommended way to get hold of Solr's own stop words? Loading of Carrot2 resources from Solr config directory --- Key: SOLR-2449 URL: https://issues.apache.org/jira/browse/SOLR-2449 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Stanislaw Osinski Fix For: 3.2, 4.0 Currently, Carrot2 clustering algorithms read linguistic resources (stop words, stop labels) from the classpath (Carrot2 JAR), which makes them difficult to edit/override. The directory from which Carrot2 should read its resources (absolute, or relative to Solr config dir) could be specified in the {{engine}} element. By default, the path could be e.g. {{solr.conf/clustering/carrot2}}. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2282: Attachment: SOLR-2282-concurrency-branch_3x.patch SOLR-2282-concurrency-trunk.patch Thanks for debugging this, Dawid! I think solution 2) you suggested would be the best because it applies both to version 3.4.2 of Carrot2 (currently used by Solr) and the 3.5.0 version (not yet released). I'm attaching patches for Solr trunk and branch_3x that fix the concurrency issue and correct a typo in a log message output by {{LuceneLanguageModelFactory}}. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282-concurrency-branch_3x.patch, SOLR-2282-concurrency-trunk.patch, SOLR-2282-diagnostics.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981168#action_12981168 ] Stanislaw Osinski commented on SOLR-2282: - Hi Robert, What's the configuration (OS / JVM) on which the test is failing for you? I can't get it to fail on my machines (Win 7 64-bit with Sun JVM 1.6.0_20 and Oracle 1.6.0_23, Ubuntu 64-bit with Sun JVM 1.6.0_20). I'm running the test using the command I found in Hudson logs (ant test -Dtestcase=DistributedClusteringComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3). S. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981168#action_12981168 ] Stanislaw Osinski edited comment on SOLR-2282 at 1/13/11 3:19 AM: -- Hi Robert, What's the configuration (OS / JVM) on which the test is failing for you? I can't get it to fail on my machines (Win 7 64-bit with Sun JVM 1.6.0_20 Client VM and Oracle 1.6.0_23 Server VM, Ubuntu 64-bit with Sun JVM 1.6.0_20 Server VM). I'm running the test using the command I found in Hudson logs (ant test -Dtestcase=DistributedClusteringComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3). S. was (Author: stanislaw.osinski): Hi Robert, What's the configuration (OS / JVM) on which the test is failing for you? I can't get it to fail on my machines (Win 7 64-bit with Sun JVM 1.6.0_20 and Oracle 1.6.0_23, Ubuntu 64-bit with Sun JVM 1.6.0_20). I'm running the test using the command I found in Hudson logs (ant test -Dtestcase=DistributedClusteringComponentTest -Dtestmethod=testDistribSearch -Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3). S. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2282: Attachment: SOLR-2282-diagnostics.patch Robert: I was using the random seed from the build result in the hope that it will fail the test for me. I'm still unable to get the exception though, with or without the seed. I suppose it shouldn't matter whether I run the complete test suite or just this one test method? (I was doing the latter to save time) If you have a spare moment, would you be able check the following two things on your machine: 1. Apply the attached diagnostics patch and run the tests. If the test doesn't fail after the change, this means there's some concurrency issue in Carrot2's internal resource pooling mechanisms that we'll need to find. This patch is not a solution to the problem though, just a diagnostic measure. 2. It's paranoid, but can you run the test with the {{-Dargs=-XX:+TraceClassLoading}} option and check that there's no old (v3.4.0) Carrot2 JAR hiding in the bushes? Version 3.4.0 had a subtle bug that could be causing the exception. If there's no traces of Carrot2 3.4.0 JAR in the classpath, we'll need to do further inspection of our code. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981241#action_12981241 ] Stanislaw Osinski commented on SOLR-2282: - {quote} well, its not completely consistent even with the seed to me (smells like a concurrency issue). {quote} This is what I've been suspecting from the beginning, I hope Dawid gets better luck at reproducing the problem on his 4-core HT machine. {quote} Silly question, but did you remove the @Ignore on DistributedClusteringComponentTest? Otherwise, the reproducibility problem could be that it doesn't consistently fail every time, even with the same seed. {quote} Yeah, I did remove the @Ignore, I'm getting Testsuite: org.apache.solr.handler.clustering.DistributedClusteringComponentTest, Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 59,658 sec in the test results dir. When it comes to reproducibility, I wasn't able to reproduce some other concurrency issue on my 2-core machine, while on Dawid's 4-core hardware the tests would fail sometimes, so I hope we can eventually get the exception locally. {quote} I ran my previous fail three times, with the patch. This failed two out of three times. {quote} Thanks for verifying this! It looks like the bug may be at some other place in C2 code than I initially thought. Let us review the code once again, as soon as we come up with the fix, I'll attach a patch. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980881#action_12980881 ] Stanislaw Osinski commented on SOLR-2282: - Sure, I'll take a look at it tomorrow morning. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2296) Upgrade Carrot2 binaries to version 3.4.2
[ https://issues.apache.org/jira/browse/SOLR-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2296: Attachment: carrot2-core-3.4.2-jdk1.5.jar Carrot2 3.4.2 core JAR compile for JDK 1.5, contrib/clustering compiles fine for me, clustering tests pass too. Upgrade Carrot2 binaries to version 3.4.2 - Key: SOLR-2296 URL: https://issues.apache.org/jira/browse/SOLR-2296 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Assignee: Koji Sekiguchi Fix For: 3.1, 4.0 Attachments: carrot2-core-3.4.2-jdk1.5.jar, carrot2-core-3.4.2.jar, SOLR-2296-branch_3.1.patch, SOLR-2296-trunk.patch Version 3.4.2 fixes a concurrency bug in Carrot2 that may be causing SOLR-2282. I'll attach patches in a minute. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2296) Upgrade Carrot2 binaries to version 3.4.2
Upgrade Carrot2 binaries to version 3.4.2 - Key: SOLR-2296 URL: https://issues.apache.org/jira/browse/SOLR-2296 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 3.1, 4.0 Version 3.4.2 fixes a concurrency bug in Carrot2 that may be causing SOLR-2282. I'll attach patches in a minute. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-2296) Upgrade Carrot2 binaries to version 3.4.2
[ https://issues.apache.org/jira/browse/SOLR-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-2296: Attachment: carrot2-core-3.4.2.jar SOLR-2296-trunk.patch SOLR-2296-branch_3.1.patch Patches for trunk, branch_3.1 and Carrot2 3.4.2 JAR (BSD License). Upgrade Carrot2 binaries to version 3.4.2 - Key: SOLR-2296 URL: https://issues.apache.org/jira/browse/SOLR-2296 Project: Solr Issue Type: Task Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 3.1, 4.0 Attachments: carrot2-core-3.4.2.jar, SOLR-2296-branch_3.1.patch, SOLR-2296-trunk.patch Version 3.4.2 fixes a concurrency bug in Carrot2 that may be causing SOLR-2282. I'll attach patches in a minute. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering
[ https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974314#action_12974314 ] Stanislaw Osinski commented on SOLR-2282: - This may be related to a concurrency bug we fixed in the latest (3.4.2) release of Carrot2. Tomorrow morning I can prepare a Carrot2 upgrade patch, which should hopefully fix the problem. Distributed Support for Search Result Clustering Key: SOLR-2282 URL: https://issues.apache.org/jira/browse/SOLR-2282 Project: Solr Issue Type: New Feature Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Koji Sekiguchi Assignee: Koji Sekiguchi Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch Brad Giaccio contributed a patch for this in SOLR-769. I'd like to incorporate it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970760#action_12970760 ] Stanislaw Osinski commented on SOLR-769: Hi Koji, Actually, the current code seems right: if we don't output subclusters, we need to include all documents of the cluster, including those from its subclusters, otherwise the subclusters' documents may not appear in the response at all. But if we do output subclusters, we add only the documents assigned specifically to the cluster because the subclusters with their documents will be included in the response too. S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: carrot2-core-3.4.0-jdk1.5.jar Hi Grant, Thanks for committing the patches! I noticed that the 3.x branch build failed because Carrot2 JAR had classes in Java 1.6 format. I'm attaching a Java 1.5-compliant JAR. After replacing the original JAR with the attached one, all Solr tests passed on Java 1.5 on my machine. Apologies for not checking this earlier. Also, I believe the last paragraph of contrib/clustering/README.txt does not hold any more as all JARs are now distributed with Solr. Staszek Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 3.1, 4.0 Attachments: carrot2-core-3.4.0-jdk1.5.jar, SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804-carrot2-3.4.0-libs.zip, SOLR-1804.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901791#action_12901791 ] Stanislaw Osinski commented on SOLR-1804: - One more thing: contrib/clustering in trunk seems to contain some leftovers from the time clustering was disabled: build.xml.disabled, DISABLED-README.txt and the LGPL-related paragraph in README.txt. I guess we could remove them too to avoid confusion. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 3.1, 4.0 Attachments: carrot2-core-3.4.0-jdk1.5.jar, SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804-carrot2-3.4.0-libs.zip, SOLR-1804.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: (was: SOLR-1804-carrot2-3.4.0-dev-libs.zip) Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: SOLR-1804-carrot2-3.4.0-libs.zip Here are the libs with Carrot2 3.4.0 JAR. 1. Apply the patch (the patch hasn't changed) 2. Copy the libs from the ZIP overwriting the old ones 3. Remove Google collections from solr/lib (it's replaced by Guava from the ZIP). If you don't do that, tests will fail due to class path conflicts. I've just tested this on my machine with the latest branch_3x (r966551) and all tests pass. If some tests fail for you, let me know and I'll investigate. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804-carrot2-3.4.0-libs.zip, SOLR-1804.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: SOLR-1804-carrot2-3.4.0-dev-trunk.patch A patch against solr trunk, the libs are the same as for the branch_3x patch. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: SOLR-1804-carrot2-3.4.0-dev.patch Ok, here's another shot. This time, the language model factory includes support for Chinese. To avoid compilation issues, the classes are loaded through reflection. Not pretty, but works. If there's a way to have access to smart chinese at compilation time, let me know, I can remove the reflection stuff, so that the refactoring is more reliable. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: SOLR-1804-carrot2-3.4.0-dev.patch Hi, As we're near the 3.4.0 release of Carrot2, I'm including a patch that upgrades the clustering plugin. The most notable changes are: * [3.4.0] Carrot2 core no longer depends on Lucene APIs, so the {{build.xml}} can be enabled again. The only class that makes use of Lucene API, {{LuceneLanguageModelFactory}}, is now included in the plugin's code, so there shouldn't be any problems with refactoring. In fact, I've already updated {{LuceneLanguageModelFactory}} to remove the use of deprecated APIs. * [3.3.0] The STC algorithm has seen some [significant scalability improvements|http://project.carrot2.org/release-3.3.0-notes.html] * [3.2.0] Carrot2 core no longer depends on LGPL libraries, so all the JARs can now be included in Solr SVN and SOLR-2007 won't need fixing. Included is a patch against r966211. A ZIP with JARs will follow in a sec. A couple of notes: * The upgrade requires upgrading Google collections to Guava. This is a drop-in replacement, all tests pass for me after the upgrade, plus the upgrade is [recommended|http://code.google.com/p/google-collections/] on the original Google Collections site. * The patch includes Carrot2 3.4.0-dev JAR, but I guess it's worth committing already to avoid the library downloads hassle (SOLR-2007). * Originally, Carrot2 supports clustering of Chinese content based on the Smart Chinese Tokenizer. This tokenizer would have to be referenced from the {{LuceneLanguageModelFactory}} class in Solr. However, when compiling the code in Ant, this smartcn doesn't seem available in the classpath. Is it a matter of modifying the build files, or it's a policy on dependencies between plugins? Let me know if you have any problems applying the patch. Thanks! S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1804: Attachment: SOLR-1804-carrot2-3.4.0-dev-libs.zip Libs required for the Carrot2 3.4.0 update. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, SOLR-1804-carrot2-3.4.0-dev.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890757#action_12890757 ] Stanislaw Osinski commented on SOLR-1804: - {quote} Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on Lucene, and we can re-enable this component in trunk, and simply maintain the LuceneLanguageModelFactory? {quote} Correct. The only dependency on Lucene is {{LuceneLanguageModelFactory}}, which is now part of Solr code base. In fact, we could also try bringing back the clustering plugin to Solr trunk, though I haven't tried that yet. {quote} As far as the smart chinese, its currently not included with Solr, so I think this is why you have trouble. But could we enable a carrot2 factory for it that reflects it, in case the user puts the jar in the classpath? {quote} Essentially, the dependency on the smart chinese is optional in a sense that the lack of it will degrade the quality of clustering in Chinese, but will not break it. Let me see if I can make it optionally loadable in {{LuceneLanguageModelFactory}} too. If not, we'll have to live with degraded clustering quality in case of Chinese. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, SOLR-1804-carrot2-3.4.0-dev.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890848#action_12890848 ] Stanislaw Osinski commented on SOLR-1804: - {quote} Essentially, the dependency on the smart chinese is optional in a sense that the lack of it will degrade the quality of clustering in Chinese, but will not break it. Let me see if I can make it optionally loadable in LuceneLanguageModelFactory too. {quote} I think we could handle this in a similar way as in Carrot2: attempt to load chinese tokenizer and fall back to the default one in case of class loading exceptions. The easiest implementation route would be to include smart chinese as a dependency during compilation of the clustering plugin with an understanding that the library may or may not be available during runtime. Is that possible with the current Solr compilation scripts? Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, SOLR-1804-carrot2-3.4.0-dev.patch http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2007) ant get-libraries tries to re-compile solr
[ https://issues.apache.org/jira/browse/SOLR-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890245#action_12890245 ] Stanislaw Osinski commented on SOLR-2007: - Hi, I'm working on upgrading Solr to the latest release of Carrot2, which has only ASL and BSD dependencies, so all the libraries should be fit for inclusion on the SVN. As soon as I have a working patch, I'll attach it to SOLR-1804. Staszek ant get-libraries tries to re-compile solr Key: SOLR-2007 URL: https://issues.apache.org/jira/browse/SOLR-2007 Project: Solr Issue Type: Bug Components: contrib - Clustering Affects Versions: 1.4, 1.4.1 Reporter: Hoss Man Fix For: 3.1, 4.0 as noted on solr-user, if someone downloads a solr distribution and tries to follow the steps for using clustering, the ant get-libraries target of contrib/clustering attempts to recompile all of solr. this seems to be because get-libraries depends on init this really needs to be fixed on both the 3.1 and 4.0 branches before we do any releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2484) Remove deprecated TermAttribute from tokenattributes and legacy support in indexer
[ https://issues.apache.org/jira/browse/LUCENE-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874008#action_12874008 ] Stanislaw Osinski commented on LUCENE-2484: --- Hi! Against which version of Lucene should we refactor/ build Carrot2 to fix the issue? Does it have to be trunk? Thanks! S. Remove deprecated TermAttribute from tokenattributes and legacy support in indexer -- Key: LUCENE-2484 URL: https://issues.apache.org/jira/browse/LUCENE-2484 Project: Lucene - Java Issue Type: Task Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.0 Attachments: LUCENE-2484.patch The title says it: - Remove interface TermAttribute - Remove empty fake implementation TermAttributeImpl extends CharTermAttributeImpl - Remove methods from CharTermAttributeImpl (and indirect from Token) - Remove sophisticated® backwards™ Layer in TermsHash* - Remove IAE from NumericTokenStream, if TA is available in AS - Fix rest of core tests (TestToken) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2484) Remove deprecated TermAttribute from tokenattributes and legacy support in indexer
[ https://issues.apache.org/jira/browse/LUCENE-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874024#action_12874024 ] Stanislaw Osinski commented on LUCENE-2484: --- {quote} Since this clustering contrib depends on binary files that are tied to specific versions of the Lucene API, I suggest the following: * only enable clustering in release branches (such as 3x) * when we cut a new release branch from trunk (say we make a 4x), then add the new version there that works with it. * but never have this enabled in trunk, as it is a cyclic dependency {quote} Sounds very good to me, thanks for the explanation! Remove deprecated TermAttribute from tokenattributes and legacy support in indexer -- Key: LUCENE-2484 URL: https://issues.apache.org/jira/browse/LUCENE-2484 Project: Lucene - Java Issue Type: Task Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 4.0 Attachments: LUCENE-2484.patch The title says it: - Remove interface TermAttribute - Remove empty fake implementation TermAttributeImpl extends CharTermAttributeImpl - Remove methods from CharTermAttributeImpl (and indirect from Token) - Remove sophisticated® backwards™ Layer in TermsHash* - Remove IAE from NumericTokenStream, if TA is available in AS - Fix rest of core tests (TestToken) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441 ] Stanislaw Osinski commented on SOLR-1804: - Hi Robert, Lucene dependency is the only change, right? Or you also upgraded Carrot2 from e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have changed e.g. because we tuned stop words or other algorithm attributes. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459 ] Stanislaw Osinski commented on SOLR-1804: - I was about to offer advice similar to Grant's, but wanted to wait to confirm the scope of changes. If it was only Lucene dependency update, with the assumption that the update didn't change the documents fed to Carrot2 in tests, the results shouldn't change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the standard Lucene one; so no Version.LUCENE_* issues as far as I can tell. I haven't got Solr code handy, but maybe the test performs clustering on summaries generated from the original test documents and Lucene 3.x introduces some changes in the way summaries are generated? If the clusters look reasonable, the problem is probably not critical, but still worth investigation to make sure it's not a bug of some kind. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
[ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462 ] Stanislaw Osinski commented on SOLR-1804: - Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be distributed together with Solr. S. Upgrade Carrot2 to 3.2.0 Key: SOLR-1804 URL: https://issues.apache.org/jira/browse/SOLR-1804 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll http://project.carrot2.org/release-3.2.0-notes.html Carrot2 is now LGPL free, which means we should be able to bundle the binary! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1809) Carrot2 clustering time logging
[ https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski resolved SOLR-1809. - Resolution: Invalid Hi Erik! You're right, {{debugQuery}} should be enough for most cases. Resolving as invalid. Carrot2 clustering time logging --- Key: SOLR-1809 URL: https://issues.apache.org/jira/browse/SOLR-1809 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 1.5 Attachments: SOLR-1809.patch It may be useful to log the amount of time Carrot2 spent on clustering. This should be helpful when debugging performance issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1809) Carrot2 clustering time logging
Carrot2 clustering time logging --- Key: SOLR-1809 URL: https://issues.apache.org/jira/browse/SOLR-1809 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 1.5 It may be useful to log the amount of time Carrot2 spent on clustering. This should be helpful when debugging performance issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1809) Carrot2 clustering time logging
[ https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1809: Attachment: SOLR-1809.patch An initial patch. I'm not sure what Solr's logging policies are, feel free to change the level as appropriate. Carrot2 clustering time logging --- Key: SOLR-1809 URL: https://issues.apache.org/jira/browse/SOLR-1809 Project: Solr Issue Type: Improvement Components: contrib - Clustering Reporter: Stanislaw Osinski Fix For: 1.5 Attachments: SOLR-1809.patch It may be useful to log the amount of time Carrot2 spent on clustering. This should be helpful when debugging performance issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-2221) Micro-benchmarks for ntz and pop (BitUtils) operations.
[ https://issues.apache.org/jira/browse/LUCENE-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805034#action_12805034 ] Stanislaw Osinski commented on LUCENE-2221: --- I ran the benchmark on a 64bit Linux running an Intel(R) Xeon(R) E5520 @ 2.27GHz. I tried both Sun's JDK 1.7-ea as well as JDK 1.6.0_18, which also has support for native {{POPCNT}}. *JDK 1.7-ea {{-server -XX:+UsePopCountInstruction}}* {noformat} # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_trunk.test_pop_array: 15/20 rounds, time.total: 7.69, time.warmup: 1.96, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_xor : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_intersect: 15/20 rounds, time.total: 11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_andnot : 15/20 rounds, time.total: 10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_pop_union: 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 42.30, time.warmup: 12.02, time.bench: 30.29, round: 6.06 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_trunk.test_ntz_iterator_long: 5/7 rounds, time.total: 55.48, time.warmup: 15.43, time.bench: 40.05, round: 8.01 [+- 0.06], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_pop3264.test_pop_array : 15/20 rounds, time.total: 7.78, time.warmup: 2.05, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_xor: 15/20 rounds, time.total: 11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_intersect : 15/20 rounds, time.total: 11.14, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_pop3264.test_pop_union : 15/20 rounds, time.total: 11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE.test_pop_array: 15/20 rounds, time.total: 5.06, time.warmup: 1.29, time.bench: 3.77, round: 0.25 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_xor : 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_intersect: 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_andnot : 15/20 rounds, time.total: 7.81, time.warmup: 1.99, time.bench: 5.81, round: 0.39 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_pop_union: 15/20 rounds, time.total: 8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 33.55, time.warmup: 8.72, time.bench: 24.83, round: 4.97 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 39.61, time.warmup: 11.48, time.bench: 28.12, round: 5.62 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 # 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, Sun Microsystems Inc., Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 3.25, time.warmup: 0.82, time.bench: 2.43, round: 0.16 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00 Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor : 15/20 rounds, time.total: 5.05,
[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing
[ https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795925#action_12795925 ] Stanislaw Osinski commented on SOLR-1692: - {quote} bq. Where should the configuration of the highlighter we use for clustering come from? We have all the code hooked in for it already, we're just ignoring the output. {quote} To avoid confusion and questions along the lines of why clusters don't match the (highlighted) documents I'm seeing, I'd suggest a slightly more elaborate scenario for the clustering highlighter configuration: 1. If main Solr highlighting is disabled, use the clustering component's highlighter settings. 2. If main Solr highlighting is enabled, use the main highlighter's configuration as the defaults and let the clustering-specific highlighter configuration override the defaults. If we do it this way, we'll minimize the chances of users accidentally performing clustering on documents different (differently highlighted) than those they will see. bq. Would be great if, Carrot2 could also just use the analysis that Lucene/Solr produces, that way it would be much easier to configure stopwords, HTML stripping, etc. This one would require some larger changes to Carrot2 internals. We do use Lucene infrastructure for preprocessing (currently for tokenization), but I can investigate if we can extend that further. A potential problem here is that very often the set of stopwords you use for document retrieval may not work equally well for clustering. I've filed a [Carrot2-specific issue|http://issues.carrot2.org/browse/CARROT-606] for it and will try to come up with something. CarrotClusteringEngine produce summary does nothing --- Key: SOLR-1692 URL: https://issues.apache.org/jira/browse/SOLR-1692 Project: Solr Issue Type: Bug Components: contrib - Clustering Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 1.5 Attachments: SOLR-1692.patch In the CarrotClusteringEngine, the produceSummary option does nothing, as the results of doing the highlighting are just ignored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-236) Field collapsing
[ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795067#action_12795067 ] Stanislaw Osinski commented on SOLR-236: Hi Grant, {quote} I would note, in looking at the Carrot2 code, they actually have a ByFieldClusteringAlgorithm (what they call synthetic clustering) which does field collapsing/clustering on a value of a field. To quote the javadocs: Clusters documents into a flat structure based on the values of some field of the documents. By default the \...@link Document#SOURCES} field is used and Name of the field to cluster by. Each non-null scalar field value with distinct hash code will give raise to a single cluster, named using the \...@link Object#toString()} value of the field. If the field value is a collection, the document will be assigned to all clusters corresponding to the values in the collection. Note that arrays will not be 'unfolded' in this way. I don't know how it performs, but it seems like it would at least be worth investigating. {quote} Carrot2's {{ByFieldClusteringAlgorithm}} is very simple. It literally throws everything into a hash map based on the field value ([source code|http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-algorithm-synthetic/src/org/carrot2/clustering/synthetic/ByFieldClusteringAlgorithm.java?r=trunk#l99]). This algorithm is used in our live demo to [cluster by news source|http://search.carrot2.org/stable/search?source=boss-newsquery=iphonealgorithm=source]. {quote} Note, they also have a synthetic one for collapsing based on URL: ByUrlClusteringAlgorithm {quote} This one creates a [hierarchy based on the URL segments|http://search.carrot2.org/stable/search?source=boss-webquery=solralgorithm=urlresults=200] and might be useful to create by-domain collapsing if needed. In general, my rough guess is that it's the criteria for content-based collapsing would be closer to duplicate detection rather than the type of grouping Carrot2 produces. Field collapsing Key: SOLR-236 URL: https://issues.apache.org/jira/browse/SOLR-236 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Emmanuel Keller Assignee: Shalin Shekhar Mangar Fix For: 1.5 Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch This patch include a new feature called Field collapsing. Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated more documents from this site link. See also Duplicate detection. http://www.fastsearch.com/glossary.aspx?m=48amid=299 The implementation add 3 new query parameters (SolrParams): collapse.field to choose the field used to group results collapse.type normal (default value) or adjacent collapse.max to select how many continuous results are allowed before collapsing TODO (in progress): - More documentation (on source code) - Test cases Two patches: - field_collapsing.patch for current development version - field_collapsing_1.1.0.patch for Solr-1.1.0 P.S.: Feedback and misspelling correction are welcome ;-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760238#action_12760238 ] Stanislaw Osinski commented on SOLR-1314: - The required change is right at the end of the big diff: {noformat} Index: contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java === --- contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java (revision 819270) +++ contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java (working copy) @@ -40,11 +40,11 @@ @SuppressWarnings(unchecked) public class CarrotClusteringEngineTest extends AbstractClusteringTest { public void testCarrotLingo() throws Exception { -checkEngine(getClusteringEngine(default), 9); +checkEngine(getClusteringEngine(default), 10); } public void testCarrotStc() throws Exception { -checkEngine(getClusteringEngine(stc), 2); +checkEngine(getClusteringEngine(stc), 1); } public void testWithoutSubclusters() throws Exception { {noformat} Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 Attachments: SOLR-1314.patch As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-1314: Attachment: SOLR-1314.patch Hi Grant, I've built Carrot2 3.1.0 binaries and tested them with Solr trunk. Attached is a patch that upgrades the libs to Carrot2 3.1.0 and fixes one unit test. S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 Attachments: SOLR-1314.patch As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759667#action_12759667 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, bq. Now that Lucene is final, can we finalize the jar for this one? Sure, over the weekend we'll be making an official Carrot2 3.1.0 release. As part of that process I'll check if the Solr plugin is working fine and will post the final JAR here. bq. Also, this final JAR will handle the license and FastVector stuff, right? Correct. The following commit removed it from trunk and hence the 3.1.0 release: http://fisheye3.atlassian.com/changelog/carrot2/?cs=3694 S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758843#action_12758843 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, I've made Carrot2's dependency on Smart Chinese Analyzer optional, so no exceptions should be thrown when the big JAR is not in the classpath. As usual, download from here: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756110#action_12756110 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, I've just dropped the patenting clause entirely. The updated license is in the repo and at: http://www.carrot2.org/carrot2.LICENSE. S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer
[ https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756177#action_12756177 ] Stanislaw Osinski commented on SOLR-1336: - Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, I'd need to make sure the clustering contrib doesn't fail when the JAR is not there and clustering in Chinese is requested (I think I'd simply log a WARN saying that the Chinese analyzer JAR is required for best clustering results). Add support for lucene's SmartChineseAnalyzer - Key: SOLR-1336 URL: https://issues.apache.org/jira/browse/SOLR-1336 Project: Solr Issue Type: New Feature Components: Analysis Reporter: Robert Muir Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese text as words. if the factories for the tokenizer and word token filter are added to solr it can be used, although there should be a sample config or wiki entry showing how to apply the built-in stopwords list. this is because it doesn't contain actual stopwords, but must be used to prevent indexing punctuation... note: we did some refactoring/cleanup on this analyzer recently, so it would be much easier to do this after the next lucene update. it has also been moved out of -analyzers.jar due to size, and now builds in its own smartcn jar file, so that would need to be added if this feature is desired. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755657#action_12755657 ] Stanislaw Osinski commented on SOLR-1314: - As a follow-up of the discussion on legal-discuss, I've removed the dependency on {{FastVector}} from Carrot2's STC algorithm. The binaries are in the usual place: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754699#action_12754699 ] Stanislaw Osinski commented on SOLR-1314: - Good point, Grant. Though the classes we included are merely definitions of native methods, it's better to keep them separate. I've just reverted back to a separate {{nni.jar}}, binaries are here: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754593#action_12754593 ] Stanislaw Osinski commented on SOLR-1314: - Let me build C2 with Lucene 2.9 RC4, will post a download URL in a while. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
[ https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754597#action_12754597 ] Stanislaw Osinski commented on SOLR-1314: - Hi Grant, Here's Carrot2 3.1-dev built with Lucene 2.9-rc4: http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/ Please note a few things about the dependencies: * {{nni.jar}} is now part of {{carrot2-mini.jar}}, so no need to download it separately * dependencies upgraded to the newer versions (http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/carrot2-mini-3.1-dev.pom), Lucene entry in the POM still needs to be upgraded for version 2.9 * Carrot2 provides experimental support for Chinese Simplified based on the smart cn analyzer -- does Solr distribute that JAR by default? Please let me know if you have any problems upgrading. S. Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Assignee: Grant Ingersoll Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, There's one more thing: we're planning to release version 3.1.0 of Carrot2 with certain bug fixes in clustering algorithm and better support for Chinese (using the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, but before Solr 1.4, so that the latter would have a newer version of Carrot2 on board (should be just a matter of replacing Carrot2 JAR / upgrading version of the downloaded dependency). Would that make sense? Should I create a separate issue for it, or rather reopen this one? Thanks, S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1314) Upgrade Carrot2 to version 3.1.0
Upgrade Carrot2 to version 3.1.0 Key: SOLR-1314 URL: https://issues.apache.org/jira/browse/SOLR-1314 Project: Solr Issue Type: Task Reporter: Stanislaw Osinski Fix For: 1.4 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes in clustering algorithms and improved clustering in Chinese. The upgrade should be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039 ] Stanislaw Osinski commented on SOLR-769: Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: subcluster-flattening.patch Hi, While configuring the clustering component for an algorithm that returns hierarchical clusters, it took me a while to debug why subclusters wouldn't appear on the output. It turned out that the default value for the {{carrot.outputSubClusters}} parameter is {{false}}, which was the opposite to what I assumed :-) Would it be a problem to change the default to {{true}}, so that other users avoid the same problem? Another improvement worth making for the {{carrot.outputSubClusters}} = {{false}} case is flattening the clusters: returning all documents of the 1st level clusters, including those contained in the subclusters the user chose not to output. Without this improvement, many document-cluster assignments may be lost because some Carrot2 algorithms will assign documents only to the leaf (deepest in the hierarchy) clusters. I'm attaching a patch that implements both changes. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739 ] Stanislaw Osinski commented on SOLR-769: bq. Is labels is needed because there could be multiple labels per cluster in the future? ( I assume yes) Correct. Currently neither of Carrot2's algorithms creates clusters with multiple labels, but it's quite likely that there are other algorithms that can do that. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534 ] Stanislaw Osinski commented on SOLR-769: In fact, you can set Carrot2 attributes (both init- and request-time) in the solr config file, this should work also without the patch. Just add: {{str name=Tokenizer.analyzerfully.qualified.class.Name/str}} to the search component element. See http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find list of Carrot2 attributes, their ids and description at: http://download.carrot2.org/stable/manual/#chapter.components. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545 ] Stanislaw Osinski commented on SOLR-769: Ah, I should have mentioned that up front -- Carrot2 will try to convert the string into the type accepted by the attribute. In case of the class-types attributes, it will try to load the class using the current thread's context classloader. Conversions are also available for numeric, boolean and enum attributes (see: http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html). Please let me know if that way works for you. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421 ] Stanislaw Osinski commented on SOLR-769: Pasting the comment I made on the list: The catch with analyzer is that this specific attribute is an initialization-time attribute, so you need to add it to the {{initAttributes}} map in the {{init()}} method of {{CarrotClusteringEngine}}. Please let me know if this solves the problem. If not, I'll investigate further. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087 ] Stanislaw Osinski commented on SOLR-769: Thanks Grant! Looking forward to seeing the code in the repo! S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463 ] Stanislaw Osinski commented on SOLR-769: Hi Grant, If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, you'll find licenses in the lib/ folder of the distribution. That distribution contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), so you'd need to pick only those that are relevant. S. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171 ] Stanislaw Osinski commented on SOLR-769: bq. Also, you say C2 can handle full docs, is it feasible, then to implement it for the offline mode I have in mind, whereby you cluster the whole collection offline and then store the clusters for retrieval? I haven't implemented this yet, but was thinking some people will be interested in full corpus clustering. The nice thing, then, is that as new documents come in, they can be added to existing clusters (and maybe periodically, we re-cluster). Just thinking outloud. We have two variables here: the length of docs and the number of docs. Carrot2 is suitable for small numbers of docs (up to say 1000). If the docs are short (a paragraph or so), the clustering should be pretty fast, suitable for on-line processing (see: http://project.carrot2.org/algorithms.html). If the documents get longer, Carrot2 will still handle them, but will require some more time for processing, I'll try to do some measurements. But C2 is not useful for the whole collection case -- it performs all processing in-memory and here we'd need a totally different class of algorithm, something along the lines of Mahout developments. bq. Hmm, that's an interesting thought. We could check to see if highlighting is done first. To quickly summarise the pros and cons of relying on highlighting being done outside of the clustering component: Pros: * we avoid duplication of processing (highlighting being done twice) * simpler code of the clustering component, less configuration Cons: * if someone doesn't want highlighting in the search results, the clustering is likely to take more time (because it operates on full documents, and it's controlled globally) * depending on the highlighter, we may get some markup in the summaries, which may affect clustering (I'd need to check how Carrot2 handles that) bq. Should the MockClusteringAlgorithm be under the test source tree and not the main one? I moved it in the patch to follow Absolutely, it should be in the test source. bq. I don't think we need to output the number of clusters, since that will be obvious from the list size. I dropped it in the patch to follow Makes sense, I kept it because the original version had it. bq. Also, on the response structure, we certainly could make it optional, although it means having to go do a lookup in the real doc list, which could be less than fun. By lookup you mean the lookup in the XML response? Here again we have a trade off between the length of the response and ease of processing: if we repeat document titles / snippets in the clusters structure, we at least double the response size (at least because the same document may belong to many clusters), but can potentially save some lookups. But if we want to get some other fields of a document (other than we repeat in the clusters list), we'd still need a lookup. To sum up, my intuition would be to avoid duplication and stick with document ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, the clustering component could have a list of configurable fields to be repeated in the cluster list if that's really helpful in real-word use cases. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: (was: SOLR-769.patch) Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: SOLR-769.zip Further code clean-ups, support for passing intialization-time attributes to Carrot2 algorithms, some comments in the example configuration file. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: (was: SOLR-769-lib.zip) Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: SOLR-769-lib.zip Libs with Carrot2 v3.0.1 we've just released. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated SOLR-769: --- Attachment: SOLR-769-lib.zip SOLR-769.patch Yet another patch, this time with passing unit tests and working example. Will make some more comments in a sec. Please use SOLR-769-lib.zip libs with this patch. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942 ] Stanislaw Osinski commented on SOLR-769: Hi All, I've just uploaded a patch that passes unit tests and has working example, but this is by no means a final version. A few outstanding questions / issues: # h4. Response structure. I was wondering -- to we need to repeat the document contents in the 'clusters' response section? Assuming that each document in the index has a unique ID, we could reduce the size of the response by just referencing documents by IDs like this: \\ {code} lst name=clusters int name=numClusters3/int lst name=cluster lst name=labels str name=labelGPU VPU Clocked/str /lst lst name=docs str name=docEN7800GTX/2DHTV/256M/str str name=doc100-435805/str /lst /lst lst name=cluster lst name=labels str name=labelHard Drive/str /lst lst name=docs str name=doc6H500F0/str str name=docSP2514N/str /lst /lst lst name=cluster lst name=labels str name=labelOther Topics/str /lst lst name=docs str name=doc9885A004/str /lst /lst {code} Actually, this is what I've implemented in the patch. Also, in case of hierarchical clusters I've introduced a grouping entity called clusters so that the top- and sub-levels or the response are consistent (see unit tests). Please let me know if this makes sense. # h4 Build: compile warnings about missing SimpleXML SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not needed at runtime, but generates warnings about missing dependencies during compile time. So the option is either to live with the warnings or to add SimpleXML (version 1.7.2) to get rid of the warnings. # h4 Build: copying of protowords.txt etc The patch includes lexical files both in the contrib/clustering/src/java/test/resources/ and in the examples dir. I'm not sure how this is handled though -- do you keep copies in the repository or copy those somehow in the build? # h4 Highlighting This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly well handle full documents (up to say a few hundred kB each), it's just the number of documents that must be in the order of hundreds. Therefore, highlighting is not mandatory, but it may sometimes improve the quality of clusters. I was wondering, if highlighting is performed earlier in the Solr pipeline, could this be reused during clustering? One possible approach could be that clustering uses whatever is fed from the pipeline: if highlighting is enabled, clustering will be performed on the highlighted content, if there was no highlighting, we'd cluster full documents. Not sure if that's reasonable / possible to implement though. # h4 Documentation (wiki) updates Once we stabilise the ideas, I'm happy to update the wiki with regard to the algorithms used (Lingo/STC) and passing additional parameters. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out,