[jira] [Created] (SOLR-4253) Misleading resource loading warning from Carrot2 clustering component

2013-01-02 Thread Stanislaw Osinski (JIRA)
Stanislaw Osinski created SOLR-4253:
---

 Summary: Misleading resource loading warning from Carrot2 
clustering component
 Key: SOLR-4253
 URL: https://issues.apache.org/jira/browse/SOLR-4253
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 4.0
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.1


{{SolrResourceLoader.openResource(String)}} now throws only {{IOException}}, 
which causes the clustering component to issue resource loading warnings even 
if the fallback resources from Carrot2 JAR is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-4253) Misleading resource loading warning from Carrot2 clustering component

2013-01-02 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-4253.
-

Resolution: Fixed

Fixed in trunk and 4.x branch.

 Misleading resource loading warning from Carrot2 clustering component
 -

 Key: SOLR-4253
 URL: https://issues.apache.org/jira/browse/SOLR-4253
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 4.0
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.1


 {{SolrResourceLoader.openResource(String)}} now throws only {{IOException}}, 
 which causes the clustering component to issue resource loading warnings even 
 if the fallback resources from Carrot2 JAR is available.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-3279) Upgrade Carrot2 to minimize the possibility of dependency clashes

2013-01-02 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-3279.
-

Resolution: Fixed

Carrot2 upgraded to 3.6.2 in trunk and 4.x branch.

NB: Carrot2 3.6.2 stock binaries ship with Guava r12, but r13 (currently in 
Solr) is backwards compatible. If at some point upgrade to r14 is needed, it 
will most likely be possible without upgrading Carrot2.

 Upgrade Carrot2 to minimize the possibility of dependency clashes
 -

 Key: SOLR-3279
 URL: https://issues.apache.org/jira/browse/SOLR-3279
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.1


 When we get closer to the 4.0 release, update Carrot2 to the then newest 
 version so that the dependencies get a refresh too (re: 
 http://lucene.472066.n3.nabble.com/Old-Google-Guava-library-needs-updating-r05-td3854433.html).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3279) Upgrade Carrot2 to minimize the possibility of dependency clashes

2013-01-01 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541640#comment-13541640
 ] 

Stanislaw Osinski commented on SOLR-3279:
-

It's high time we upgraded, I'll take a look at this tomorrow.

 Upgrade Carrot2 to minimize the possibility of dependency clashes
 -

 Key: SOLR-3279
 URL: https://issues.apache.org/jira/browse/SOLR-3279
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.1


 When we get closer to the 4.0 release, update Carrot2 to the then newest 
 version so that the dependencies get a refresh too (re: 
 http://lucene.472066.n3.nabble.com/Old-Google-Guava-library-needs-updating-r05-td3854433.html).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults

2012-05-21 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13280023#comment-13280023
 ] 

Stanislaw Osinski commented on SOLR-3470:
-

Not pretty indeed, but still better than hardcoding Carrot2 attribute names. 
I'll commit this in a moment.

 Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
 --

 Key: SOLR-3470
 URL: https://issues.apache.org/jira/browse/SOLR-3470
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.6
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: SOLR-3470.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults

2012-05-21 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-3470.
-

Resolution: Fixed

Dawid's patch committed to trunk and 3.6 branch.

 Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
 --

 Key: SOLR-3470
 URL: https://issues.apache.org/jira/browse/SOLR-3470
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.6
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: SOLR-3470.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults

2012-05-20 Thread Stanislaw Osinski (JIRA)
Stanislaw Osinski created SOLR-3470:
---

 Summary: Custom Carrot2 tokenizer and stemmer factories 
overwritten by defaults
 Key: SOLR-3470
 URL: https://issues.apache.org/jira/browse/SOLR-3470
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.6
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.6.1




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults

2012-05-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-3470:


Fix Version/s: 4.0

 Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
 --

 Key: SOLR-3470
 URL: https://issues.apache.org/jira/browse/SOLR-3470
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.6
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.0, 3.6.1




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults

2012-05-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-3470.
-

Resolution: Fixed

Fixed in trunk and 3.6.1 branch.

 Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
 --

 Key: SOLR-3470
 URL: https://issues.apache.org/jira/browse/SOLR-3470
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.6
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.0, 3.6.1




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-3470) Custom Carrot2 tokenizer and stemmer factories overwritten by defaults

2012-05-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-3470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski reopened SOLR-3470:
-


Unit tests pass fine, but Carrot2's internal class resolution code (context 
class loader) doesn't play well with how Solr loads contrib classes in webapp 
mode.

A brute-force fix would be to do the class loading the Solr way in the 
clustering component and pass class objects instead of strings to Carrot2.

 Custom Carrot2 tokenizer and stemmer factories overwritten by defaults
 --

 Key: SOLR-3470
 URL: https://issues.apache.org/jira/browse/SOLR-3470
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.6
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 4.0, 3.6.1




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2706) The carrot.lexicalResourcesDir parameter does not work with absolute directories

2011-08-11 Thread Stanislaw Osinski (JIRA)
The carrot.lexicalResourcesDir parameter does not work with absolute directories


 Key: SOLR-2706
 URL: https://issues.apache.org/jira/browse/SOLR-2706
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.3, 3.2
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.4, 4.0




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2011-08-03 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13078632#comment-13078632
 ] 

Stanislaw Osinski commented on SOLR-1692:
-

Looking at the code, the issue is resolved, summaries (from highlighter) are 
used for clustering when configured. I see there's no unit test for the feature 
though, so I can write one and resolve the issue.

 CarrotClusteringEngine produce summary does nothing
 ---

 Key: SOLR-1692
 URL: https://issues.apache.org/jira/browse/SOLR-1692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Grant Ingersoll
 Fix For: 3.4, 4.0

 Attachments: SOLR-1692.patch


 In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
 results of doing the highlighting are just ignored.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2011-08-03 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski reassigned SOLR-1692:
---

Assignee: Stanislaw Osinski

 CarrotClusteringEngine produce summary does nothing
 ---

 Key: SOLR-1692
 URL: https://issues.apache.org/jira/browse/SOLR-1692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Stanislaw Osinski
 Fix For: 3.4, 4.0

 Attachments: SOLR-1692.patch


 In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
 results of doing the highlighting are just ignored.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2692) Typo in clustering fragment size param name

2011-08-03 Thread Stanislaw Osinski (JIRA)
Typo in clustering fragment size param name
---

 Key: SOLR-2692
 URL: https://issues.apache.org/jira/browse/SOLR-2692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.3
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.1.1, 3.4


The param should be {{carrot.fragSize}} but it's {{carrot.fragzise}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2692) Typo in clustering fragment size param name

2011-08-03 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2692:


Fix Version/s: (was: 3.1.1)

I mistook 3.1.1 for 3.3.1.

 Typo in clustering fragment size param name
 ---

 Key: SOLR-2692
 URL: https://issues.apache.org/jira/browse/SOLR-2692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.3
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.4


 The param should be {{carrot.fragSize}} but it's {{carrot.fragzise}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2011-08-03 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-1692.
-

   Resolution: Fixed
Fix Version/s: (was: 3.4)
   (was: 4.0)
   3.1

This issue was really fixed for 3.1.0 and documented in CHANGES under that 
release. It doesn't make sense to complicate things further as I suggested in 
the discussion above, so resolving.

 CarrotClusteringEngine produce summary does nothing
 ---

 Key: SOLR-1692
 URL: https://issues.apache.org/jira/browse/SOLR-1692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Stanislaw Osinski
 Fix For: 3.1

 Attachments: SOLR-1692.patch


 In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
 results of doing the highlighting are just ignored.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2692) Typo in clustering fragment size param name

2011-08-03 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2692.
-

Resolution: Fixed

Fixed in trunk and branch_3x, ClusteringComponent wiki updated to warn the 
users of this bug.

 Typo in clustering fragment size param name
 ---

 Key: SOLR-2692
 URL: https://issues.apache.org/jira/browse/SOLR-2692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 3.3
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.4


 The param should be {{carrot.fragSize}} but it's {{carrot.fragzise}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2561) SimpleXML notice is a copy of mahout-math notice

2011-05-31 Thread Stanislaw Osinski (JIRA)
SimpleXML notice is a copy of mahout-math notice


 Key: SOLR-2561
 URL: https://issues.apache.org/jira/browse/SOLR-2561
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Trivial
 Fix For: 3.3, 4.0


The note should probably say something like:

This product includes software developed by
the SimpleXML project (http://simple.sourceforge.net/).


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033919#comment-13033919
 ] 

Stanislaw Osinski commented on SOLR-2448:
-

Hi, if there are no objections, I'd like to commit this patch later today. 
Thanks! S.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
 SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2450.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Carrot2 clustering should use both its own and Solr's stop words
 

 Key: SOLR-2450
 URL: https://issues.apache.org/jira/browse/SOLR-2450
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2450.patch


 While using only Solr's stop words for clustering isn't a good idea (compared 
 to indexing, clustering needs more aggressive stop word removal to get 
 reasonable cluster labels), it would be good if Carrot2 used both its own and 
 Solr's stop words.
 I'm not sure what the best way to implement this would be though. My first 
 thought was to simply load {{stopwords.txt}} from Solr config dir and merge 
 them with Carrot2's. But then, maybe a better approach would be to get the 
 stop words from the StopFilter being used? Ideally, we should also consider 
 the per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2449) Loading of Carrot2 resources from Solr config directory

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2449.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Loading of Carrot2 resources from Solr config directory
 ---

 Key: SOLR-2449
 URL: https://issues.apache.org/jira/browse/SOLR-2449
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
 Fix For: 3.2, 4.0

 Attachments: SOLR-2449.patch


 Currently, Carrot2 clustering algorithms read linguistic resources (stop 
 words, stop labels) from the classpath (Carrot2 JAR), which makes them 
 difficult to edit/override. The directory from which Carrot2 should read its 
 resources (absolute, or relative to Solr config dir) could be specified in 
 the {{engine}} element. By default, the path could be e.g. 
 {{solr.conf/clustering/carrot2}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2448.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
 SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2505) Output cluster scores

2011-05-16 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-2505.
-

Resolution: Fixed

Committed to trunk and branch_3x.

 Output cluster scores
 -

 Key: SOLR-2505
 URL: https://issues.apache.org/jira/browse/SOLR-2505
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


 Carrot2 algorithms compute cluster scores; we could expose them on the output 
 from Solr clustering component. Along with scores, we can output a boolean 
 flag that marks the Other Topics groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-11 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Attachment: (was: SOLR-2448-2449-2450-2505-trunk.zip)

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
 SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-11 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Attachment: carrot2-core-3.5.0.jar
SOLR-2448-2449-2450-2505-trunk.patch
SOLR-2448-2449-2450-2505-branch_3x.patch

Hi, here's another set of patches (svn this time) against trunk and branch_3x. 
I've corrected Maven configs and checked that the project builds fine using mvn 
install.

After applying the patches you'd need to manually update the JARs:

In trunk, delete:

trunk/solr/contrib/clustering/lib/carrot2-core-3.4.2.jar
trunk/solr/contrib/clustering/lib/hppc-0.3.1.jar

and replace them with new versions:

http://repo1.maven.org/maven2/org/carrot2/carrot2-core/3.5.0/carrot2-core-3.5.0.jar
http://repo1.maven.org/maven2/com/carrotsearch/hppc/0.3.3/hppc-0.3.3.jar


In branch_3x, delete:

branch_3x/solr/contrib/clustering/lib/carrot2-core-3.4.2.jar
branch_3x/solr/contrib/clustering/lib/hppc-0.3.1.jar

and replace them with new versions:

carrot2-core-3.5.0.jar attached (jdk15 backport)
http://repo1.maven.org/maven2/com/carrotsearch/hppc/0.3.4/hppc-0.3.4-jdk15.jar


It'd be great if someone could review these before I make the commit.

Thanks!

S.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-branch_3x.patch, 
 SOLR-2448-2449-2450-2505-trunk.patch, carrot2-core-3.5.0.jar


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-10 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Attachment: SOLR-2448-2449-2450-2505-trunk.zip

Hi, we've finally [released Carrot2 
3.5.0|http://project.carrot2.org/release-3.5.0], so I'm attaching the patch 
(git) against Solr trunk for your review. The patch contains several separate 
commits related to the upgrade (SOLR-2448, SOLR-2449, SOLR-2450, SOLR-2505), I 
hope it will be easier to review this way.

One thing I'm wondering about is Maven artifact generation that seems to be 
gone from trunk contribs (compared to the 3.x branch). Let me know if I need to 
update the dependencies/version numbers anywhere.

The patch for Solr 3.x is in the works, we need to release JDK1.5-compatible 
version of some of the dependencies (HPPC) to make it happen.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-trunk.zip, SOLR-2448.zip


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-10 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Attachment: (was: SOLR-2448.zip)

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-trunk.zip


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-10 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Attachment: SOLR-2448-2449-2450-2505-svn.patch

Hi Steven,

Thanks for you help and apologies for git confusion, here's the SVN patch. 
After patching, you'd also need to delete:

trunk/solr/contrib/clustering/lib/carrot2-core-3.4.2.jar
trunk/solr/contrib/clustering/lib/hppc-0.3.1.jar

and replace them with new versions:

http://repo1.maven.org/maven2/org/carrot2/carrot2-core/3.5.0/carrot2-core-3.5.0.jar
http://repo1.maven.org/maven2/com/carrotsearch/hppc/0.3.3/hppc-0.3.3.jar



 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-svn.patch, 
 SOLR-2448-2449-2450-2505-trunk.zip


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-10 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031186#comment-13031186
 ] 

Stanislaw Osinski commented on SOLR-2448:
-

bq. So, I first tried running ant generate-maven-artifacts from solr/ on trunk, 
without applying your patches, and all artifacts, including contribs, are 
generated under solr/package/maven/. Are you using a different Ant target for 
Maven artifact generation?

The target runs fine for me too (on the patched code). I just wanted to update 
the version number of the Carrot2 dependency, but couldn't find any file 
referencing the old number (3.4.2). Now I see that the generated 
solr-clustering POM XML has carrot2-core as a dependency, but does not specify 
the exact version number. I guess there's some more Maven magic I need to learn 
to understand this :-)

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-svn.patch, 
 SOLR-2448-2449-2450-2505-trunk.zip


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-05-10 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031197#comment-13031197
 ] 

Stanislaw Osinski commented on SOLR-2448:
-

bq. Versions for all dependencies for both Solr and Lucene are specified in one 
place: the grandparent POM, in the root of the sources.

Everything is clear then, thanks! I'll update the version number and remove 
Carrot2 Maven repository, the latest Carrot2 binaries are now available from 
Maven central.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448-2449-2450-2505-svn.patch, 
 SOLR-2448-2449-2450-2505-trunk.zip


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2505) Output cluster scores

2011-05-09 Thread Stanislaw Osinski (JIRA)
Output cluster scores
-

 Key: SOLR-2505
 URL: https://issues.apache.org/jira/browse/SOLR-2505
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


Carrot2 algorithms compute cluster scores; we could expose them on the output 
from Solr clustering component. Along with scores, we can output a boolean flag 
that marks the Other Topics groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-04-02 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Attachment: SOLR-2448.zip

Initial patch (git) based on Carrot2 3.5.0-dev, against Solr trunk. As soon as 
we make the stable 3.5.0 release, I'll submit the final patch for your review.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2448.zip


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2449) Loading of Carrot2 resources from Solr config directory

2011-04-02 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2449:


Attachment: SOLR-2449.patch

The patch requires the SOLR-2448 patch applied.

 Loading of Carrot2 resources from Solr config directory
 ---

 Key: SOLR-2449
 URL: https://issues.apache.org/jira/browse/SOLR-2449
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
 Fix For: 3.2, 4.0

 Attachments: SOLR-2449.patch


 Currently, Carrot2 clustering algorithms read linguistic resources (stop 
 words, stop labels) from the classpath (Carrot2 JAR), which makes them 
 difficult to edit/override. The directory from which Carrot2 should read its 
 resources (absolute, or relative to Solr config dir) could be specified in 
 the {{engine}} element. By default, the path could be e.g. 
 {{solr.conf/clustering/carrot2}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words

2011-04-02 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2450:


Attachment: SOLR-2450.patch

Patch for the use of stop words from the field's {{StopWordFilterFactory}} and 
{{CommonGramsFilterFactory}} in addition to Carrot2's built-in stop words.

Requires the SOLR-2448 and SOLR-2449 patches applied. 

 Carrot2 clustering should use both its own and Solr's stop words
 

 Key: SOLR-2450
 URL: https://issues.apache.org/jira/browse/SOLR-2450
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: SOLR-2450.patch


 While using only Solr's stop words for clustering isn't a good idea (compared 
 to indexing, clustering needs more aggressive stop word removal to get 
 reasonable cluster labels), it would be good if Carrot2 used both its own and 
 Solr's stop words.
 I'm not sure what the best way to implement this would be though. My first 
 thought was to simply load {{stopwords.txt}} from Solr config dir and merge 
 them with Carrot2's. But then, maybe a better approach would be to get the 
 stop words from the StopFilter being used? Ideally, we should also consider 
 the per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-03-30 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013547#comment-13013547
 ] 

Stanislaw Osinski commented on SOLR-2448:
-

Oh, is there any way to assign this issue to myself? It looks like I don't have 
this permission now.

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2449) Loading of Carrot2 resources from Solr config directory

2011-03-30 Thread Stanislaw Osinski (JIRA)
Loading of Carrot2 resources from Solr config directory
---

 Key: SOLR-2449
 URL: https://issues.apache.org/jira/browse/SOLR-2449
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 3.2, 4.0


Currently, Carrot2 clustering algorithms read linguistic resources (stop words, 
stop labels) from the classpath (Carrot2 JAR), which makes them difficult to 
edit/override. The directory from which Carrot2 should read its resources 
(absolute, or relative to Solr config dir) could be specified in the {{engine}} 
element. By default, the path could be e.g. {{solr.conf/clustering/carrot2}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words

2011-03-30 Thread Stanislaw Osinski (JIRA)
Carrot2 clustering should use both its own and Solr's stop words


 Key: SOLR-2450
 URL: https://issues.apache.org/jira/browse/SOLR-2450
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


While using only Solr's stop words for clustering isn't a good idea (compared 
to indexing, clustering needs more aggressive stop word removal to get 
reasonable cluster labels), it would be good if Carrot2 used both its own and 
Solr's stop words.

I'm not sure what the best way to implement this would be though. My first 
thought was to simply load {{stopwords.txt}} from Solr config dir and merge 
them with Carrot2's. But then, maybe a better approach would be to get the stop 
words from the StopFilter being used? Ideally, we should also consider the 
per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words

2011-03-30 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski reassigned SOLR-2450:
---

Assignee: Stanislaw Osinski

 Carrot2 clustering should use both its own and Solr's stop words
 

 Key: SOLR-2450
 URL: https://issues.apache.org/jira/browse/SOLR-2450
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


 While using only Solr's stop words for clustering isn't a good idea (compared 
 to indexing, clustering needs more aggressive stop word removal to get 
 reasonable cluster labels), it would be good if Carrot2 used both its own and 
 Solr's stop words.
 I'm not sure what the best way to implement this would be though. My first 
 thought was to simply load {{stopwords.txt}} from Solr config dir and merge 
 them with Carrot2's. But then, maybe a better approach would be to get the 
 stop words from the StopFilter being used? Ideally, we should also consider 
 the per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2448) Upgrade Carrot2 to version 3.5.0

2011-03-30 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2448:


Assignee: Stanislaw Osinski

Yes, thanks!

 Upgrade Carrot2 to version 3.5.0
 

 Key: SOLR-2448
 URL: https://issues.apache.org/jira/browse/SOLR-2448
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Priority: Minor
 Fix For: 3.2, 4.0


 Carrot2 version 3.5.0 should be available very soon. After the upgrade, it 
 will be possible to implement a few improvements to the clustering plugin; 
 I'll file separate issues for these.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2449) Loading of Carrot2 resources from Solr config directory

2011-03-30 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013614#comment-13013614
 ] 

Stanislaw Osinski commented on SOLR-2449:
-

This is exactly how I implemented it. I'll attach a patch for review when we 
release and integrate Carrot2 3.5.0 (required for this improvement to work).

A more interesting case though is SOLR-2450 -- any hints about the recommended 
way to get hold of Solr's own stop words?

 Loading of Carrot2 resources from Solr config directory
 ---

 Key: SOLR-2449
 URL: https://issues.apache.org/jira/browse/SOLR-2449
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
 Fix For: 3.2, 4.0


 Currently, Carrot2 clustering algorithms read linguistic resources (stop 
 words, stop labels) from the classpath (Carrot2 JAR), which makes them 
 difficult to edit/override. The directory from which Carrot2 should read its 
 resources (absolute, or relative to Solr config dir) could be specified in 
 the {{engine}} element. By default, the path could be e.g. 
 {{solr.conf/clustering/carrot2}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2282) Distributed Support for Search Result Clustering

2011-01-15 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2282:


Attachment: SOLR-2282-concurrency-branch_3x.patch
SOLR-2282-concurrency-trunk.patch

Thanks for debugging this, Dawid! I think solution 2) you suggested would be 
the best because it applies both to version 3.4.2 of Carrot2 (currently used by 
Solr) and the 3.5.0 version (not yet released).

I'm attaching patches for Solr trunk and branch_3x that fix the concurrency 
issue and correct a typo in a log message output by 
{{LuceneLanguageModelFactory}}.

 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282-concurrency-branch_3x.patch, 
 SOLR-2282-concurrency-trunk.patch, SOLR-2282-diagnostics.patch, 
 SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282_test.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering

2011-01-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981168#action_12981168
 ] 

Stanislaw Osinski commented on SOLR-2282:
-

Hi Robert,

What's the configuration (OS / JVM) on which the test is failing for you? I 
can't get it to fail on my machines (Win 7 64-bit with Sun JVM 1.6.0_20 and 
Oracle 1.6.0_23, Ubuntu 64-bit with Sun JVM 1.6.0_20). I'm running the test 
using the command I found in Hudson logs (ant test 
-Dtestcase=DistributedClusteringComponentTest -Dtestmethod=testDistribSearch 
-Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3).

S.

 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2282) Distributed Support for Search Result Clustering

2011-01-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981168#action_12981168
 ] 

Stanislaw Osinski edited comment on SOLR-2282 at 1/13/11 3:19 AM:
--

Hi Robert,

What's the configuration (OS / JVM) on which the test is failing for you? I 
can't get it to fail on my machines (Win 7 64-bit with Sun JVM 1.6.0_20 Client 
VM and Oracle 1.6.0_23 Server VM, Ubuntu 64-bit with Sun JVM 1.6.0_20 Server 
VM). I'm running the test using the command I found in Hudson logs (ant test 
-Dtestcase=DistributedClusteringComponentTest -Dtestmethod=testDistribSearch 
-Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3).

S.

  was (Author: stanislaw.osinski):
Hi Robert,

What's the configuration (OS / JVM) on which the test is failing for you? I 
can't get it to fail on my machines (Win 7 64-bit with Sun JVM 1.6.0_20 and 
Oracle 1.6.0_23, Ubuntu 64-bit with Sun JVM 1.6.0_20). I'm running the test 
using the command I found in Hudson logs (ant test 
-Dtestcase=DistributedClusteringComponentTest -Dtestmethod=testDistribSearch 
-Dtests.seed=41204997274180:6405396687385598457 -Dtests.multiplier=3).

S.
  
 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2282) Distributed Support for Search Result Clustering

2011-01-13 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2282:


Attachment: SOLR-2282-diagnostics.patch

Robert: I was using the random seed from the build result in the hope that it 
will fail the test for me. I'm still unable to get the exception though, with 
or without the seed. I suppose it shouldn't matter whether I run the complete 
test suite or just this one test method? (I was doing the latter to save time)

If you have a spare moment, would you be able check the following two things on 
your machine:

1. Apply the attached diagnostics patch and run the tests. If the test doesn't 
fail after the change, this means there's some concurrency issue in Carrot2's 
internal resource pooling mechanisms that we'll need to find. This patch is not 
a solution to the problem though, just a diagnostic measure.

2. It's paranoid, but can you run the test with the 
{{-Dargs=-XX:+TraceClassLoading}} option and check that there's no old (v3.4.0) 
Carrot2 JAR hiding in the bushes? Version 3.4.0 had a subtle bug that could be 
causing the exception. If there's no traces of Carrot2 3.4.0 JAR in the 
classpath, we'll need to do further inspection of our code.

 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282_test.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering

2011-01-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981241#action_12981241
 ] 

Stanislaw Osinski commented on SOLR-2282:
-

{quote}
well, its not completely consistent even with the seed to me (smells like a 
concurrency issue).
{quote}

This is what I've been suspecting from the beginning, I hope Dawid gets better 
luck at reproducing the problem on his 4-core HT machine.

{quote}
Silly question, but did you remove the @Ignore on 
DistributedClusteringComponentTest?
Otherwise, the reproducibility problem could be that it doesn't consistently 
fail every time, even with the same seed.
{quote}

Yeah, I did remove the @Ignore, I'm getting Testsuite: 
org.apache.solr.handler.clustering.DistributedClusteringComponentTest, Tests 
run: 1, Failures: 0, Errors: 0, Time elapsed: 59,658 sec in the test results 
dir. When it comes to reproducibility, I wasn't able to reproduce some other 
concurrency issue on my 2-core machine, while on Dawid's 4-core hardware the 
tests would fail sometimes, so I hope we can eventually get the exception 
locally.

{quote}
I ran my previous fail three times, with the patch. This failed two out of 
three times.
{quote}

Thanks for verifying this! It looks like the bug may be at some other place in 
C2 code than I initially thought. Let us review the code once again, as soon as 
we come up with the fix, I'll attach a patch.


 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282-diagnostics.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282_test.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering

2011-01-12 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980881#action_12980881
 ] 

Stanislaw Osinski commented on SOLR-2282:
-

Sure, I'll take a look at it tomorrow morning. 

 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch, SOLR-2282_test.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2296) Upgrade Carrot2 binaries to version 3.4.2

2010-12-29 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2296:


Attachment: carrot2-core-3.4.2-jdk1.5.jar

Carrot2 3.4.2 core JAR compile for JDK 1.5, contrib/clustering compiles fine 
for me, clustering tests pass too.

 Upgrade Carrot2 binaries to version 3.4.2
 -

 Key: SOLR-2296
 URL: https://issues.apache.org/jira/browse/SOLR-2296
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Koji Sekiguchi
 Fix For: 3.1, 4.0

 Attachments: carrot2-core-3.4.2-jdk1.5.jar, carrot2-core-3.4.2.jar, 
 SOLR-2296-branch_3.1.patch, SOLR-2296-trunk.patch


 Version 3.4.2 fixes a concurrency bug in Carrot2 that may be causing 
 SOLR-2282. I'll attach patches in a minute.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2296) Upgrade Carrot2 binaries to version 3.4.2

2010-12-23 Thread Stanislaw Osinski (JIRA)
Upgrade Carrot2 binaries to version 3.4.2
-

 Key: SOLR-2296
 URL: https://issues.apache.org/jira/browse/SOLR-2296
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 3.1, 4.0


Version 3.4.2 fixes a concurrency bug in Carrot2 that may be causing SOLR-2282. 
I'll attach patches in a minute.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2296) Upgrade Carrot2 binaries to version 3.4.2

2010-12-23 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-2296:


Attachment: carrot2-core-3.4.2.jar
SOLR-2296-trunk.patch
SOLR-2296-branch_3.1.patch

Patches for trunk, branch_3.1 and Carrot2 3.4.2 JAR (BSD License).

 Upgrade Carrot2 binaries to version 3.4.2
 -

 Key: SOLR-2296
 URL: https://issues.apache.org/jira/browse/SOLR-2296
 Project: Solr
  Issue Type: Task
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 3.1, 4.0

 Attachments: carrot2-core-3.4.2.jar, SOLR-2296-branch_3.1.patch, 
 SOLR-2296-trunk.patch


 Version 3.4.2 fixes a concurrency bug in Carrot2 that may be causing 
 SOLR-2282. I'll attach patches in a minute.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2282) Distributed Support for Search Result Clustering

2010-12-22 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974314#action_12974314
 ] 

Stanislaw Osinski commented on SOLR-2282:
-

This may be related to a concurrency bug we fixed in the latest (3.4.2) release 
of Carrot2. Tomorrow morning I can prepare a Carrot2 upgrade patch, which 
should hopefully fix the problem.

 Distributed Support for Search Result Clustering
 

 Key: SOLR-2282
 URL: https://issues.apache.org/jira/browse/SOLR-2282
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: SOLR-2282.patch, SOLR-2282.patch, SOLR-2282.patch, 
 SOLR-2282.patch, SOLR-2282.patch


 Brad Giaccio contributed a patch for this in SOLR-769. I'd like to 
 incorporate it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2010-12-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970760#action_12970760
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Koji,

Actually, the current code seems right: if we don't output subclusters, we need 
to include all documents of the cluster, including those from its subclusters, 
otherwise the subclusters' documents may not appear in the response at all. But 
if we do output subclusters, we add only the documents assigned specifically to 
the cluster because the subclusters with their documents will be included in 
the response too.

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-08-24 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: carrot2-core-3.4.0-jdk1.5.jar

Hi Grant,

Thanks for committing the patches! I noticed that the 3.x branch build failed 
because Carrot2 JAR had classes in Java 1.6 format. I'm attaching a Java 
1.5-compliant JAR. After replacing the original JAR with the attached one, all 
Solr tests passed on Java 1.5 on my machine. Apologies for not checking this 
earlier.

Also, I believe the last paragraph of contrib/clustering/README.txt does not 
hold any more as all JARs are now distributed with Solr.

Staszek

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 3.1, 4.0

 Attachments: carrot2-core-3.4.0-jdk1.5.jar, 
 SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch, 
 SOLR-1804-carrot2-3.4.0-libs.zip, SOLR-1804.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-08-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901791#action_12901791
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

One more thing: contrib/clustering in trunk seems to contain some leftovers 
from the time clustering was disabled: build.xml.disabled, DISABLED-README.txt 
and the LGPL-related paragraph in README.txt. I guess we could remove them too 
to avoid confusion.

S.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 3.1, 4.0

 Attachments: carrot2-core-3.4.0-jdk1.5.jar, 
 SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch, 
 SOLR-1804-carrot2-3.4.0-libs.zip, SOLR-1804.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-08-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: (was: SOLR-1804-carrot2-3.4.0-dev-libs.zip)

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev-trunk.patch, 
 SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-08-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-libs.zip

Here are the libs with Carrot2 3.4.0 JAR.

1. Apply the patch (the patch hasn't changed)
2. Copy the libs from the ZIP overwriting the old ones
3. Remove Google collections from solr/lib (it's replaced by Guava from the 
ZIP). If you don't do that, tests will fail due to class path conflicts.

I've just tested this on my machine with the latest branch_3x (r966551) and all 
tests pass. If some tests fail for you, let me know and I'll investigate.

S.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev-trunk.patch, 
 SOLR-1804-carrot2-3.4.0-dev.patch, SOLR-1804-carrot2-3.4.0-libs.zip, 
 SOLR-1804.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-28 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-dev-trunk.patch

A patch against solr trunk, the libs are the same as for the branch_3x patch.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
 SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-22 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-dev.patch

Ok, here's another shot. This time, the language model factory includes support 
for Chinese. To avoid compilation issues, the classes are loaded through 
reflection. Not pretty, but works. If there's a way to have access to smart 
chinese at compilation time, let me know, I can remove the reflection stuff, so 
that the refactoring is more reliable.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-dev.patch

Hi,

As we're near the 3.4.0 release of Carrot2, I'm including a patch that upgrades 
the clustering plugin. The most notable changes are:

* [3.4.0] Carrot2 core no longer depends on Lucene APIs, so the {{build.xml}} 
can be enabled again. The only class that makes use of Lucene API, 
{{LuceneLanguageModelFactory}}, is now included in the plugin's code, so there 
shouldn't be any problems with refactoring. In fact, I've already updated 
{{LuceneLanguageModelFactory}} to remove the use of deprecated APIs.
* [3.3.0] The STC algorithm has seen some [significant scalability 
improvements|http://project.carrot2.org/release-3.3.0-notes.html]
* [3.2.0] Carrot2 core no longer depends on LGPL libraries, so all the JARs can 
now be included in Solr SVN and SOLR-2007 won't need fixing.

Included is a patch against r966211. A ZIP with JARs will follow in a sec.

A couple of notes:

* The upgrade requires upgrading Google collections to Guava. This is a drop-in 
replacement, all tests pass for me after the upgrade, plus the upgrade is 
[recommended|http://code.google.com/p/google-collections/] on the original 
Google Collections site.
* The patch includes Carrot2 3.4.0-dev JAR, but I guess it's worth committing 
already to avoid the library downloads hassle (SOLR-2007).
* Originally, Carrot2 supports clustering of Chinese content based on the Smart 
Chinese Tokenizer. This tokenizer would have to be referenced from the 
{{LuceneLanguageModelFactory}} class in Solr. However, when compiling the code 
in Ant, this smartcn doesn't seem available in the classpath. Is it a matter of 
modifying the build files, or it's a policy on dependencies between plugins?

Let me know if you have any problems applying the patch.

Thanks!

S.


 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-dev-libs.zip

Libs required for the Carrot2 3.4.0 update.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
 SOLR-1804-carrot2-3.4.0-dev.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890757#action_12890757
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

{quote}
Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on 
Lucene, and we can re-enable this component in trunk, and simply maintain the 
LuceneLanguageModelFactory? 
{quote}

Correct. The only dependency on Lucene is {{LuceneLanguageModelFactory}}, which 
is now part of Solr code base. In fact, we could also try bringing back the 
clustering plugin to Solr trunk, though I haven't tried that yet.

{quote}
As far as the smart chinese, its currently not included with Solr, so I think 
this is why you have trouble. But could we enable a carrot2 factory for it that 
reflects it, in case the user puts the jar in the classpath?
{quote}

Essentially, the dependency on the smart chinese is optional in a sense that 
the lack of it will degrade the quality of clustering in Chinese, but will not 
break it. Let me see if I can make it optionally loadable in 
{{LuceneLanguageModelFactory}} too. If not, we'll have to live with degraded 
clustering quality in case of Chinese.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
 SOLR-1804-carrot2-3.4.0-dev.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890848#action_12890848
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

{quote}
Essentially, the dependency on the smart chinese is optional in a sense that 
the lack of it will degrade the quality of clustering in Chinese, but will not 
break it. Let me see if I can make it optionally loadable in 
LuceneLanguageModelFactory too.
{quote}

I think we could handle this in a similar way as in Carrot2: attempt to load 
chinese tokenizer and fall back to the default one in case of class loading 
exceptions. The easiest implementation route would be to include smart chinese 
as a dependency during compilation of the clustering plugin with an 
understanding that the library may or may not be available during runtime. Is 
that possible with the current Solr compilation scripts?

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
 SOLR-1804-carrot2-3.4.0-dev.patch


 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2007) ant get-libraries tries to re-compile solr

2010-07-20 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890245#action_12890245
 ] 

Stanislaw Osinski commented on SOLR-2007:
-

Hi,

I'm working on upgrading Solr to the latest release of Carrot2, which has only 
ASL and BSD dependencies, so all the libraries should be fit for inclusion on 
the SVN. As soon as I have a working patch, I'll attach it to SOLR-1804.

Staszek

 ant get-libraries tries to re-compile solr
 

 Key: SOLR-2007
 URL: https://issues.apache.org/jira/browse/SOLR-2007
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Affects Versions: 1.4, 1.4.1
Reporter: Hoss Man
 Fix For: 3.1, 4.0


 as noted on solr-user, if someone downloads a solr distribution and tries to 
 follow the steps for using clustering, the ant get-libraries target of 
 contrib/clustering attempts to recompile all of solr.
 this seems to be because get-libraries depends on init
 this really needs to be fixed on both the 3.1 and 4.0 branches before we do 
 any releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2484) Remove deprecated TermAttribute from tokenattributes and legacy support in indexer

2010-06-01 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874008#action_12874008
 ] 

Stanislaw Osinski commented on LUCENE-2484:
---

Hi!

Against which version of Lucene should we refactor/ build Carrot2 to fix the 
issue? Does it have to be trunk?

Thanks!

S.

 Remove deprecated TermAttribute from tokenattributes and legacy support in 
 indexer
 --

 Key: LUCENE-2484
 URL: https://issues.apache.org/jira/browse/LUCENE-2484
 Project: Lucene - Java
  Issue Type: Task
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-2484.patch


 The title says it:
 - Remove interface TermAttribute
 - Remove empty fake implementation TermAttributeImpl extends 
 CharTermAttributeImpl
 - Remove methods from CharTermAttributeImpl (and indirect from Token)
 - Remove sophisticated® backwards™ Layer in TermsHash*
 - Remove IAE from NumericTokenStream, if TA is available in AS
 - Fix rest of core tests (TestToken)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2484) Remove deprecated TermAttribute from tokenattributes and legacy support in indexer

2010-06-01 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874024#action_12874024
 ] 

Stanislaw Osinski commented on LUCENE-2484:
---

{quote}
Since this clustering contrib depends on binary files that are tied to specific 
versions of the Lucene API,
I suggest the following:
* only enable clustering in release branches (such as 3x)
* when we cut a new release branch from trunk (say we make a 4x), then add the 
new version there that works with it.
* but never have this enabled in trunk, as it is a cyclic dependency
{quote}

Sounds very good to me, thanks for the explanation!


 Remove deprecated TermAttribute from tokenattributes and legacy support in 
 indexer
 --

 Key: LUCENE-2484
 URL: https://issues.apache.org/jira/browse/LUCENE-2484
 Project: Lucene - Java
  Issue Type: Task
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.0

 Attachments: LUCENE-2484.patch


 The title says it:
 - Remove interface TermAttribute
 - Remove empty fake implementation TermAttributeImpl extends 
 CharTermAttributeImpl
 - Remove methods from CharTermAttributeImpl (and indirect from Token)
 - Remove sophisticated® backwards™ Layer in TermsHash*
 - Remove IAE from NumericTokenStream, if TA is available in AS
 - Fix rest of core tests (TestToken)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845441#action_12845441
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Hi Robert,

Lucene dependency is the only change, right? Or you also upgraded Carrot2 from 
e.g. 3.1 to 3.2? If the latter is the case, the number of cluster may have 
changed e.g. because we tuned stop words or other algorithm attributes.

S.



 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845459#action_12845459
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

I was about to offer advice similar to Grant's, but wanted to wait to confirm 
the scope of changes.

If it was only Lucene dependency update, with the assumption that the update 
didn't change the documents fed to Carrot2 in tests, the results shouldn't 
change. Carrot2 uses Lucene interfaces internally, but the tokenizer is not the 
standard Lucene one; so no Version.LUCENE_* issues as far as I can tell.

I haven't got Solr code handy, but maybe the test performs clustering on 
summaries generated from the original test documents and Lucene 3.x introduces 
some changes in the way summaries are generated?

If the clusters look reasonable, the problem is probably not critical, but 
still worth investigation to make sure it's not a bug of some kind.

S.


 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-03-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12845462#action_12845462
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

Yeah, the clusters look good. When you're done with upgrading Lucene to 3.x, we 
could also upgrade Carrot2 to version 3.2.0, which is LGPL-free and could be 
distributed together with Solr.

S.

 Upgrade Carrot2 to 3.2.0
 

 Key: SOLR-1804
 URL: https://issues.apache.org/jira/browse/SOLR-1804
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 http://project.carrot2.org/release-3.2.0-notes.html
 Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1809) Carrot2 clustering time logging

2010-03-07 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski resolved SOLR-1809.
-

Resolution: Invalid

Hi Erik! You're right, {{debugQuery}} should be enough for most cases. 
Resolving as invalid.

 Carrot2 clustering time logging
 ---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5

 Attachments: SOLR-1809.patch


 It may be useful to log the amount of time Carrot2 spent on clustering. This 
 should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1809) Carrot2 clustering time logging

2010-03-05 Thread Stanislaw Osinski (JIRA)
Carrot2 clustering time logging
---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5


It may be useful to log the amount of time Carrot2 spent on clustering. This 
should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1809) Carrot2 clustering time logging

2010-03-05 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1809:


Attachment: SOLR-1809.patch

An initial patch. I'm not sure what Solr's logging policies are, feel free to 
change the level as appropriate.

 Carrot2 clustering time logging
 ---

 Key: SOLR-1809
 URL: https://issues.apache.org/jira/browse/SOLR-1809
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Clustering
Reporter: Stanislaw Osinski
 Fix For: 1.5

 Attachments: SOLR-1809.patch


 It may be useful to log the amount of time Carrot2 spent on clustering. This 
 should be helpful when debugging performance issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-2221) Micro-benchmarks for ntz and pop (BitUtils) operations.

2010-01-26 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805034#action_12805034
 ] 

Stanislaw Osinski commented on LUCENE-2221:
---

I ran the benchmark on a 64bit Linux running an Intel(R) Xeon(R) E5520 @ 
2.27GHz. I tried both Sun's JDK 1.7-ea as well as JDK 1.6.0_18, which also has 
support for native {{POPCNT}}.

*JDK 1.7-ea {{-server -XX:+UsePopCountInstruction}}*

{noformat}
# 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, 
Sun Microsystems Inc.,
Benchmark_BitUtil_trunk.test_pop_array: 15/20 rounds, time.total: 
7.69, time.warmup: 1.96, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_trunk.test_pop_xor  : 15/20 rounds, time.total: 
11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_trunk.test_pop_intersect: 15/20 rounds, time.total: 
11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_trunk.test_pop_andnot   : 15/20 rounds, time.total: 
10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_trunk.test_pop_union: 15/20 rounds, time.total: 
11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_trunk.test_ntz_iterator_int : 5/7 rounds, time.total: 
42.30, time.warmup: 12.02, time.bench: 30.29, round: 6.06 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_trunk.test_ntz_iterator_long: 5/7 rounds, time.total: 
55.48, time.warmup: 15.43, time.bench: 40.05, round: 8.01 [+- 0.06], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
# 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, 
Sun Microsystems Inc.,
Benchmark_BitUtil_pop3264.test_pop_array  : 15/20 rounds, time.total: 
7.78, time.warmup: 2.05, time.bench: 5.73, round: 0.38 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_pop3264.test_pop_xor: 15/20 rounds, time.total: 
11.13, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_pop3264.test_pop_intersect  : 15/20 rounds, time.total: 
11.14, time.warmup: 2.82, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_pop3264.test_pop_andnot : 15/20 rounds, time.total: 
10.46, time.warmup: 2.66, time.bench: 7.80, round: 0.52 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_pop3264.test_pop_union  : 15/20 rounds, time.total: 
11.13, time.warmup: 2.81, time.bench: 8.32, round: 0.55 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
# 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, 
Sun Microsystems Inc.,
Benchmark_BitUtil_popNtzJRE.test_pop_array: 15/20 rounds, time.total: 
5.06, time.warmup: 1.29, time.bench: 3.77, round: 0.25 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE.test_pop_xor  : 15/20 rounds, time.total: 
8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE.test_pop_intersect: 15/20 rounds, time.total: 
8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE.test_pop_andnot   : 15/20 rounds, time.total: 
7.81, time.warmup: 1.99, time.bench: 5.81, round: 0.39 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE.test_pop_union: 15/20 rounds, time.total: 
8.54, time.warmup: 2.15, time.bench: 6.39, round: 0.43 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_int : 5/7 rounds, time.total: 
33.55, time.warmup: 8.72, time.bench: 24.83, round: 4.97 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE.test_ntz_iterator_long: 5/7 rounds, time.total: 
39.61, time.warmup: 11.48, time.bench: 28.12, round: 5.62 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
# 1.7.0-ea-fastdebug, Java HotSpot(TM) 64-Bit Server VM, 17.0-b07-fastdebug, 
Sun Microsystems Inc.,
Benchmark_BitUtil_popNtzJRE_simple.test_pop_array : 15/20 rounds, time.total: 
3.25, time.warmup: 0.82, time.bench: 2.43, round: 0.16 [+- 0.00], round.gc: 
0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
Benchmark_BitUtil_popNtzJRE_simple.test_pop_xor   : 15/20 rounds, time.total: 
5.05, 

[jira] Commented: (SOLR-1692) CarrotClusteringEngine produce summary does nothing

2010-01-02 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795925#action_12795925
 ] 

Stanislaw Osinski commented on SOLR-1692:
-

{quote}
bq. Where should the configuration of the highlighter we use for clustering 
come from?

We have all the code hooked in for it already, we're just ignoring the output.
{quote}

To avoid confusion and questions along the lines of why clusters don't match 
the (highlighted) documents I'm seeing, I'd suggest a slightly more elaborate 
scenario for the clustering highlighter configuration:

1. If main Solr highlighting is disabled, use the clustering component's 
highlighter settings.
2. If main Solr highlighting is enabled, use the main highlighter's 
configuration as the defaults and let the clustering-specific highlighter 
configuration override the defaults.

If we do it this way, we'll minimize the chances of users accidentally 
performing clustering on documents different (differently highlighted) than 
those they will see.

bq. Would be great if, Carrot2 could also just use the analysis that 
Lucene/Solr produces, that way it would be much easier to configure stopwords, 
HTML stripping, etc.

This one would require some larger changes to Carrot2 internals. We do use 
Lucene infrastructure for preprocessing (currently for tokenization), but I can 
investigate if we can extend that further. A potential problem here is that 
very often the set of stopwords you use for document retrieval may not work 
equally well for clustering. I've filed a [Carrot2-specific 
issue|http://issues.carrot2.org/browse/CARROT-606] for it and will try to come 
up with something.

 CarrotClusteringEngine produce summary does nothing
 ---

 Key: SOLR-1692
 URL: https://issues.apache.org/jira/browse/SOLR-1692
 Project: Solr
  Issue Type: Bug
  Components: contrib - Clustering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 1.5

 Attachments: SOLR-1692.patch


 In the CarrotClusteringEngine, the produceSummary option does nothing, as the 
 results of doing the highlighting are just ignored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-236) Field collapsing

2009-12-29 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795067#action_12795067
 ] 

Stanislaw Osinski commented on SOLR-236:


Hi Grant,

{quote}
I would note, in looking at the Carrot2 code, they actually have a 
ByFieldClusteringAlgorithm (what they call synthetic clustering) which does 
field collapsing/clustering on a value of a field. To quote the javadocs:

Clusters documents into a flat structure based on the values of some field of 
the documents. By default the \...@link Document#SOURCES} field is used and  
Name of the field to cluster by. Each non-null scalar field value with distinct 
hash code will give raise to a single cluster, named using the \...@link 
Object#toString()} value of the field. If the field value is a collection, the 
document will be assigned to all clusters corresponding to the values in the 
collection. Note that arrays will not be 'unfolded' in this way.

I don't know how it performs, but it seems like it would at least be worth 
investigating.
{quote}

Carrot2's {{ByFieldClusteringAlgorithm}} is very simple. It literally throws 
everything into a hash map based on the field value ([source 
code|http://fisheye3.atlassian.com/browse/carrot2/trunk/core/carrot2-algorithm-synthetic/src/org/carrot2/clustering/synthetic/ByFieldClusteringAlgorithm.java?r=trunk#l99]).
 This algorithm is used in our live demo to [cluster by news 
source|http://search.carrot2.org/stable/search?source=boss-newsquery=iphonealgorithm=source].

{quote}
Note, they also have a synthetic one for collapsing based on URL: 
ByUrlClusteringAlgorithm
{quote}

This one creates a [hierarchy based on the URL 
segments|http://search.carrot2.org/stable/search?source=boss-webquery=solralgorithm=urlresults=200]
 and might be useful to create by-domain collapsing if needed.

In general, my rough guess is that it's the criteria for content-based 
collapsing would be closer to duplicate detection rather than the type of 
grouping Carrot2 produces.

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: 1.5

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, 
 field-collapse-4-with-solrj.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 quasidistributed.additional.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, 
 SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760238#action_12760238
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

The required change is right at the end of the big diff:

{noformat}
Index: 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
===
--- 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
 (revision 819270)
+++ 
contrib/clustering/src/test/java/org/apache/solr/handler/clustering/carrot2/CarrotClusteringEngineTest.java
 (working copy)
@@ -40,11 +40,11 @@
 @SuppressWarnings(unchecked)
 public class CarrotClusteringEngineTest extends AbstractClusteringTest {
   public void testCarrotLingo() throws Exception {
-checkEngine(getClusteringEngine(default), 9);
+checkEngine(getClusteringEngine(default), 10);
   }
 
   public void testCarrotStc() throws Exception {
-checkEngine(getClusteringEngine(stc), 2);
+checkEngine(getClusteringEngine(stc), 1);
   }
 
   public void testWithoutSubclusters() throws Exception {
{noformat}

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1314.patch


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-27 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1314:


Attachment: SOLR-1314.patch

Hi Grant,

I've built Carrot2 3.1.0 binaries and tested them with Solr trunk. Attached is 
a patch that upgrades the libs to Carrot2 3.1.0 and fixes one unit test.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1314.patch


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-25 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759667#action_12759667
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

bq. Now that Lucene is final, can we finalize the jar for this one? 

Sure, over the weekend we'll be making an official Carrot2 3.1.0 release. As 
part of that process I'll check if the Solr plugin is working fine and will 
post the final JAR here.

bq. Also, this final JAR will handle the license and FastVector stuff, right?

Correct. The following commit removed it from trunk and hence the 3.1.0 release:

http://fisheye3.atlassian.com/changelog/carrot2/?cs=3694

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758843#action_12758843
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

I've made Carrot2's dependency on Smart Chinese Analyzer optional, so no 
exceptions should be thrown when the big JAR is not in the classpath. As usual, 
download from here:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756110#action_12756110
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

I've just dropped the patenting clause entirely. The updated license is in the 
repo and at: http://www.carrot2.org/carrot2.LICENSE.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1336) Add support for lucene's SmartChineseAnalyzer

2009-09-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756177#action_12756177
 ] 

Stanislaw Osinski commented on SOLR-1336:
-

Keeping the Chinese analyzer JAR optional sounds good. As Carrot2 also uses it, 
I'd need to make sure the clustering contrib doesn't fail when the JAR is not 
there and clustering in Chinese is requested (I think I'd simply log a WARN 
saying that the Chinese analyzer JAR is required for best clustering results).

 Add support for lucene's SmartChineseAnalyzer
 -

 Key: SOLR-1336
 URL: https://issues.apache.org/jira/browse/SOLR-1336
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Reporter: Robert Muir
 Attachments: SOLR-1336.patch, SOLR-1336.patch, SOLR-1336.patch


 SmartChineseAnalyzer was contributed to lucene, it indexes simplified chinese 
 text as words.
 if the factories for the tokenizer and word token filter are added to solr it 
 can be used, although there should be a sample config or wiki entry showing 
 how to apply the built-in stopwords list.
 this is because it doesn't contain actual stopwords, but must be used to 
 prevent indexing punctuation... 
 note: we did some refactoring/cleanup on this analyzer recently, so it would 
 be much easier to do this after the next lucene update.
 it has also been moved out of -analyzers.jar due to size, and now builds in 
 its own smartcn jar file, so that would need to be added if this feature is 
 desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-15 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755657#action_12755657
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

As a follow-up of the discussion on legal-discuss, I've removed the dependency 
on {{FastVector}} from Carrot2's STC algorithm. The binaries are in the usual 
place:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-13 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754699#action_12754699
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Good point, Grant. Though the classes we included are merely definitions of 
native methods, it's better to keep them separate. I've just reverted back to a 
separate {{nni.jar}}, binaries are here:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-12 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754593#action_12754593
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Let me build C2 with Lucene 2.9 RC4, will post a download URL in a while.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-09-12 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754597#action_12754597
 ] 

Stanislaw Osinski commented on SOLR-1314:
-

Hi Grant,

Here's Carrot2 3.1-dev built with Lucene 2.9-rc4:

http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/

Please note a few things about the dependencies:

* {{nni.jar}} is now part of {{carrot2-mini.jar}}, so no need to download it 
separately
* dependencies upgraded to the newer versions 
(http://download.carrot2.org/maven2/org/carrot2/carrot2-mini/3.1-dev/carrot2-mini-3.1-dev.pom),
 Lucene entry in the POM still needs to be upgraded for version 2.9
* Carrot2 provides experimental support for Chinese Simplified based on the 
smart cn analyzer -- does Solr distribute that JAR by default?

Please let me know if you have any problems upgrading.

S.

 Upgrade Carrot2 to version 3.1.0
 

 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
Assignee: Grant Ingersoll
 Fix For: 1.4


 As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
 in clustering algorithms and improved clustering in Chinese. The upgrade 
 should be a matter of upgrading {{carrot2-mini.jar}} and 
 {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736030#action_12736030
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

There's one more thing: we're planning to release version 3.1.0 of Carrot2 with 
certain bug fixes in clustering algorithm and better support for Chinese (using 
the new analyzer from Lucene). Our plan is to release after Lucene 2.9 is out, 
but before Solr 1.4, so that the latter would have a newer version of Carrot2 
on board (should be just a matter of replacing Carrot2 JAR / upgrading version 
of the downloaded dependency). Would that make sense? Should I create a 
separate issue for it, or rather reopen this one?

Thanks,

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1314) Upgrade Carrot2 to version 3.1.0

2009-07-28 Thread Stanislaw Osinski (JIRA)
Upgrade Carrot2 to version 3.1.0


 Key: SOLR-1314
 URL: https://issues.apache.org/jira/browse/SOLR-1314
 Project: Solr
  Issue Type: Task
Reporter: Stanislaw Osinski
 Fix For: 1.4


As soon as Lucene 2.9 is releases, Carrot2 3.1.0 will come out with bug fixes 
in clustering algorithms and improved clustering in Chinese. The upgrade should 
be a matter of upgrading {{carrot2-mini.jar}} and {{google-collections.jar}}.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-07-28 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736039#action_12736039
 ] 

Stanislaw Osinski commented on SOLR-769:


Created: SOLR-1314. I'll attach a patch there as soon as Lucene 2.9 is released.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-07-08 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: subcluster-flattening.patch

Hi,

While configuring the clustering component for an algorithm that returns 
hierarchical clusters, it took me a while to debug why subclusters wouldn't 
appear on the output. It turned out that the default value for the 
{{carrot.outputSubClusters}} parameter is {{false}}, which was the opposite to 
what I assumed :-) Would it be a problem to change the default to {{true}}, so 
that other users avoid the same problem? 

Another improvement worth making for the {{carrot.outputSubClusters}} = 
{{false}} case is flattening the clusters: returning all documents of the 1st 
level clusters, including those contained in the subclusters the user chose not 
to output. Without this improvement, many document-cluster assignments may be 
lost because some Carrot2 algorithms will assign documents only to the leaf 
(deepest in the hierarchy) clusters.

I'm attaching a patch that implements both changes.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip, subcluster-flattening.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-06-30 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725739#action_12725739
 ] 

Stanislaw Osinski commented on SOLR-769:


bq. Is labels is needed because there could be multiple labels per cluster in 
the future? ( I assume yes)

Correct. Currently neither of Carrot2's algorithms creates clusters with 
multiple labels, but it's quite likely that there are other algorithms that can 
do that.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712534#action_12712534
 ] 

Stanislaw Osinski commented on SOLR-769:


In fact, you can set Carrot2 attributes (both init- and request-time) in the 
solr config file, this should work also without the patch. Just add:

{{str name=Tokenizer.analyzerfully.qualified.class.Name/str}}

to the search component element. See 
http://wiki.apache.org/solr/ClusteringComponent for some example. You'll find 
list of Carrot2 attributes, their ids and description at: 
http://download.carrot2.org/stable/manual/#chapter.components.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-24 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712545#action_12712545
 ] 

Stanislaw Osinski commented on SOLR-769:


Ah, I should have mentioned that up front -- Carrot2 will try to convert the 
string into the type accepted by the attribute. In case of the class-types 
attributes, it will try to load the class using the current thread's context 
classloader. Conversions are also available for numeric, boolean and enum 
attributes (see: 
http://download.carrot2.org/head/javadoc/org/carrot2/util/attribute/AttributeBinder.AttributeTransformerFromString.html).
 Please let me know if that way works for you.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-23 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712421#action_12712421
 ] 

Stanislaw Osinski commented on SOLR-769:


Pasting the comment I made on the list:

The catch with analyzer is that this specific attribute is an 
initialization-time attribute, so you need to add it to the {{initAttributes}} 
map in the {{init()}} method of {{CarrotClusteringEngine}}.

Please let me know if this solves the problem. If not, I'll investigate further.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-componet-shard.patch, clustering-libs.tar, 
 clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.tar, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-05-16 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710087#action_12710087
 ] 

Stanislaw Osinski commented on SOLR-769:


Thanks Grant! Looking forward to seeing the code in the repo!

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, 
 SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-04-03 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695463#action_12695463
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi Grant,

If you download http://download.carrot2.org/stable/carrot2-java-api-3.0.1.zip, 
you'll find licenses in the lib/ folder of the distribution. That distribution 
contains slightly more JARs than needed for Solr (which uses carrot2-mini.jar), 
so you'd need to pick only those that are relevant.

S.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-22 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12688171#action_12688171
 ] 

Stanislaw Osinski commented on SOLR-769:


bq. Also, you say C2 can handle full docs, is it feasible, then to implement it 
for the offline mode I have in mind, whereby you cluster the whole collection 
offline and then store the clusters for retrieval? I haven't implemented this 
yet, but was thinking some people will be interested in full corpus clustering. 
The nice thing, then, is that as new documents come in, they can be added to 
existing clusters (and maybe periodically, we re-cluster). Just thinking 
outloud.

We have two variables here: the length of docs and the number of docs. Carrot2 
is suitable for small numbers of docs (up to say 1000). If the docs are short 
(a paragraph or so), the clustering should be pretty fast, suitable for on-line 
processing (see: http://project.carrot2.org/algorithms.html). If the documents 
get longer, Carrot2 will still handle them, but will require some more time for 
processing, I'll try to do some measurements. But C2 is not useful for the 
whole collection case -- it performs all processing in-memory and here we'd 
need a totally different class of algorithm, something along the lines of 
Mahout developments.

bq. Hmm, that's an interesting thought. We could check to see if highlighting 
is done first.

To quickly summarise the pros and cons of relying on highlighting being done 
outside of the clustering component:

Pros:

* we avoid duplication of processing (highlighting being done twice)
* simpler code of the clustering component, less configuration

Cons:

* if someone doesn't want highlighting in the search results, the clustering is 
likely to take more time (because it operates on full documents, and it's 
controlled globally)
* depending on the highlighter, we may get some markup in the summaries, which 
may affect clustering (I'd need to check how Carrot2 handles that)

bq. Should the MockClusteringAlgorithm be under the test source tree and not 
the main one? I moved it in the patch to follow 

Absolutely, it should be in the test source.

bq. I don't think we need to output the number of clusters, since that will be 
obvious from the list size. I dropped it in the patch to follow

Makes sense, I kept it because the original version had it.

bq. Also, on the response structure, we certainly could make it optional, 
although it means having to go do a lookup in the real doc list, which could be 
less than fun.

By lookup you mean the lookup in the XML response? Here again we have a trade 
off between the length of the response and ease of processing: if we repeat 
document titles / snippets in the clusters structure, we at least double the 
response size (at least because the same document may belong to many clusters), 
but can potentially save some lookups. But if we want to get some other fields 
of a document (other than we repeat in the clusters list), we'd still need a 
lookup. 

To sum up, my intuition would be to avoid duplication and stick with document 
ids in cluster list (this is what we do in Carrot2 XMLs as well). Optionally, 
the clustering component could have a list of configurable fields to be 
repeated in the cluster list if that's really helpful in real-word use cases.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or 

[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: (was: SOLR-769.patch)

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-20 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769.zip

Further code clean-ups, support for passing intialization-time attributes to 
Carrot2 algorithms, some comments in the example configuration file.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.4

 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.zip


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-18 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: (was: SOLR-769-lib.zip)

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-18 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769-lib.zip

Libs with Carrot2 v3.0.1 we've just released.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-769:
---

Attachment: SOLR-769-lib.zip
SOLR-769.patch

Yet another patch, this time with passing unit tests and working example. Will 
make some more comments in a sec. Please use SOLR-769-lib.zip libs with this 
patch.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2009-03-11 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680942#action_12680942
 ] 

Stanislaw Osinski commented on SOLR-769:


Hi All,

I've just uploaded a patch that passes unit tests and has working example, but 
this is by no means a final version. A few outstanding questions / issues:

# h4. Response structure.

I was wondering -- to we need to repeat the document contents in the 'clusters' 
response section? Assuming that each document in the index has a unique ID, we 
could reduce the size of the response by just referencing documents by IDs like 
this:
\\
{code}
lst name=clusters
 int name=numClusters3/int
 lst name=cluster
  lst name=labels
str name=labelGPU VPU Clocked/str
  /lst
  lst name=docs
str name=docEN7800GTX/2DHTV/256M/str
str name=doc100-435805/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelHard Drive/str
  /lst
  lst name=docs
str name=doc6H500F0/str
str name=docSP2514N/str
  /lst
 /lst
 lst name=cluster
  lst name=labels
str name=labelOther Topics/str
  /lst
  lst name=docs
str name=doc9885A004/str
  /lst
 /lst
{code}
Actually, this is what I've implemented in the patch.

Also, in case of hierarchical clusters I've introduced a grouping entity called 
clusters so that the top- and sub-levels or the response are consistent (see 
unit tests). Please let me know if this makes sense.

# h4 Build: compile warnings about missing SimpleXML

SimpleXML is one of the problematic dependencies as it's GPL. Luckily, it's not 
needed at runtime, but generates warnings about missing dependencies during 
compile time. So the option is either to live with the warnings or to add 
SimpleXML (version 1.7.2) to get rid of the warnings.

# h4 Build: copying of protowords.txt etc

The patch includes lexical files both in the 
contrib/clustering/src/java/test/resources/ and in the examples dir. I'm 
not sure how this is handled though -- do you keep copies in the repository or 
copy those somehow in the build?

# h4 Highlighting

This is the bit I've not yet fully analyzed. In general, Carrot2 should fairly 
well handle full documents (up to say a few hundred kB each), it's just the 
number of documents that must be in the order of hundreds. Therefore, 
highlighting is not mandatory, but it may sometimes improve the quality of 
clusters.

I was wondering, if highlighting is performed earlier in the Solr pipeline, 
could this be reused during clustering? One possible approach could be that 
clustering uses whatever is fed from the pipeline: if highlighting is enabled, 
clustering will be performed on the highlighted content, if there was no 
highlighting, we'd cluster full documents. Not sure if that's reasonable / 
possible to implement though.

# h4 Documentation (wiki) updates

Once we stabilise the ideas, I'm happy to update the wiki with regard to the 
algorithms used (Lingo/STC) and passing additional parameters.

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, clustering-libs.tar, 
 SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, 
 SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, 

  1   2   >