date:20100914


[ 
https://issues.apache.org/jira/browse/SOLR-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909160#action_12909160
 ] 

Michael McCandless commented on SOLR-1900:
--

I think it makes sense to move append to BytesRef, though I wonder if it should 
it over-allocate (ArrayUtil.oversize) when it grows?  I realize for the current 
calls to append we don't need that (you just append bigTerm, once), but if 
someone uses this like a StringBuffer... though, this isn't really the 
intention of BytesRef, so maybe it's OK to not oversize.

 move Solr to flex APIs
 --

 Key: SOLR-1900
 URL: https://issues.apache.org/jira/browse/SOLR-1900
 Project: Solr
  Issue Type: Improvement
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: SOLR-1900-facet_enum.patch, SOLR-1900-facet_enum.patch, 
 SOLR-1900_bigTerm.txt, SOLR-1900_FileFloatSource.patch, 
 SOLR-1900_termsComponent.txt


 Solr should use flex APIs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Now, a lost data problem with trunk too

2010-09-14 Thread karl.wright

Yes. Of course.  My oversight.

So I did the obvious thing and searched for the value field directly, and it is 
there:

str name=idPOI|DEU:205:20187477:1014564|brandenburger tor/strstr 
name=languageger/strstr name=latitude52.39935/strstr 
name=longitude13.04793/strstr name=referencebrandenburger tor, 
potsdam, deutschland/str


So, something about the way I am searching for it is not right.  Looking 
elsewhere.

Karl



From: ext Simon Willnauer [simon.willna...@googlemail.com]
Sent: Tuesday, September 14, 2010 4:52 AM
To: dev@lucene.apache.org
Subject: Re: Now, a lost data problem with trunk too

On Tue, Sep 14, 2010 at 10:37 AM,  karl.wri...@nokia.com wrote:
 Hi folks,

 It looks like the handle leak may be real - Simon Willnauer has been looking 
 at it and could not find an explanation for the behavior I have been seeing.  
 But before we got too far on that problem, I encountered what appears to be 
 an even more serious problem.  Specifically, I'm losing field data out of 
 some records.

 The index I'm building is fairly large - some 25M records when complete.  
 What I'm seeing is that the main searchable field (value) is not finding 
 all the records it should.  I was able to locate one such record just now:

 curl 
 http://localhost:8983/solr/nose/standard?fl=*,scoreq=id:\POI|DEU:205:20187477:1014564|brandenburger+tor\
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime95/intlst name=paramsstr 
 name=qid:POI|DEU:205:20187477:1014564|brandenburger tor/strstr 
 name=fl*,score/str/lst/lstresult name=response numFound=1 
 start=0 maxScore=17.335964docfloat name=score17.335964/floatstr 
 name=entityidPOI|DEU:205:20187477:1014564|brandenburger tor/strstr 
 name=idPOI|DEU:205:20187477:1014564|brandenburger tor/strstr 
 name=referencebrandenburger tor, potsdam, deutschland/strstr 
 name=typepoi/str ... /doc/result
 /response

 .. but it is completely missing the supposedly required value field:

   !-- The value field.  This contains the actual string that will be 
 matched.--
   field name=value type=string_idx  required=true stored=false/
that does not show up since it is not stored - maybe thats the reason :)

simon

 The code that does the indexing is straightforward, and *some* of the records 
 of this class are indeed searchable via the value field, but others aren't. 
  I know the value field is non-empty, because it is used to construct the 
 id field, which is correct above.

 Simon is also looking into this one, but if anyone else has advice for 
 figuring out what's going wrong, please let me know.  FWIW, this is a trunk 
 build from Monday morning.

 Karl

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] Commented: (SOLR-2106) Spelling Checking for Multiple Fields

2010-09-14 Thread Erick Erickson

See: http://wiki.apache.org/solr/HowToContribute#Working_With_Patches

http://wiki.apache.org/solr/HowToContribute#Working_With_PatchesErick

On Tue, Sep 14, 2010 at 5:11 AM, JAYABAALAN V (JIRA) j...@apache.orgwrote:


[
 https://issues.apache.org/jira/browse/SOLR-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909157#action_12909157]

 JAYABAALAN V commented on SOLR-2106:
 

 what is procedure to download the SOLR-2010.patch files into the exisiting
 Apache Solr v1.4

  Spelling Checking for Multiple Fields
  -
 
  Key: SOLR-2106
  URL: https://issues.apache.org/jira/browse/SOLR-2106
  Project: Solr
   Issue Type: Bug
   Components: spellchecker
 Affects Versions: 1.4
  Environment: Linux Environment
 Reporter: JAYABAALAN V
  Fix For: 1.4
 
Original Estimate: 0.02h
   Remaining Estimate: 0.02h
 
  Need to enable spellchecking for five different field and it's
 configuration.I am using dismax query parser for searching the different
 fields in the simple.If user has entered a wrong spelling in the front
 end.It should check in the five different fields and give collate spelling
 suggestion in the front end and should get a result based on the spelling
 suggestion.Do provide your configuration details for the same...

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2504) sorting performance regression


[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909214#action_12909214
 ] 

Robert Muir commented on LUCENE-2504:
-

{quote}
Java (Oracle) really needs to do something to address this.
{quote}

I think we all owe it to ourselves to stop equating java with Sun/Oracle, if 
Java 
stays with Oracle its pretty obvious the language (is) will die anyway.

{quote}
I think this is a severe and growing problem for Lucene going forward
- our search performance is crucial and we can't risk hotspot
randomly, substantially slowing things down by alot.
{quote}

While I agree at the moment we should make efforts to work around issues like 
this,
I don't think we should jump the gun and make real design/architectural
choices based on Oracle bugs.

Especially for trunk, by the time we release Lucene 4.0 some other company
will probably own Java anyway.

{quote}
Not that we have a choice here... but I've often wondered whether .NET
has this same hotspot fickleness problem
{quote}

.NET is not a choice but generating C/C++ code is?
 

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1682) Implement CollapseComponent

2010-09-14 Thread Varun Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909223#action_12909223
 ] 

Varun Gupta commented on SOLR-1682:
---

Is there any workaround to use Highlight and Facet components along with 
grouping?

 Implement CollapseComponent
 ---

 Key: SOLR-1682
 URL: https://issues.apache.org/jira/browse/SOLR-1682
 Project: Solr
  Issue Type: Sub-task
  Components: search
Reporter: Martijn van Groningen
Assignee: Shalin Shekhar Mangar
 Fix For: Next

 Attachments: field-collapsing.patch, SOLR-1682.patch, 
 SOLR-1682.patch, SOLR-1682.patch, SOLR-1682.patch, SOLR-1682.patch, 
 SOLR-1682.patch, SOLR-1682.patch, SOLR-1682_prototype.patch, 
 SOLR-1682_prototype.patch, SOLR-1682_prototype.patch, SOLR-236.patch


 Child issue of SOLR-236. This issue is dedicated to field collapsing in 
 general and all its code (CollapseComponent, DocumentCollapsers and 
 CollapseCollectors). The main goal is the finalize the request parameters and 
 response format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2504) sorting performance regression

[
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909230#action_12909230
]

Simon Willnauer commented on LUCENE-2504:
-

bq. I think we all owe it to ourselves to stop equating java with Sun/Oracle,
if Java stays with Oracle its pretty obvious the language (is) will die anyway.
I agree with robert that we should stop comparing against sun jvms all the time
and turn everything upside-down specializing code here and there or go one step
further and generate C++ code. Dude who is gonna maintain the compatibility to
Java-Only environments? I could imagine that we have something which is super
special purpose like mike did with DirectNIOFSDirectory to work around
unexposed methods like fadvice.

I think that code specializations of very hot part of lucene are ok and we
should follow that way like we did at some places but it already make things
very complicated to follow. Without the knowledge of a committer or a person
actively following that development it is extremely difficult to comprehend
design decisions.

I would rather like the idea to put effort in stuff like harmony and make code
we can control perform better that introducing a preprocessor which generates
code for a JVM owned by a company. Would it make way more sense to push OSS
JVMs than spending lots of time on investigating on .NET as an alternative or
C/C++ code generator? Before I would go the C++ path I'd rather use Java to
host a C core like lucy which brings you as close as it gets to the machine.

bq. EG, see my post here:

interesting papers - seems we are touching the limits of Java though.

sorting performance regression
--

Key: LUCENE-2504
URL: https://issues.apache.org/jira/browse/LUCENE-2504
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
Fix For: 4.0

Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip

sorting can be much slower on trunk than branch_3x

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Whither ORP?

2010-09-14 Thread Grant Ingersoll


On Sep 13, 2010, at 12:33 PM, Itamar Syn-Hershko wrote:

 With the proper two-way open-source development process (taking and then 
 giving) I think it can become an important part of open-IR technologies, just 
 like what Lucene did to the search engines world. What ORP has to offer is of 
 great interest to HebMorph, an open-source project of mine trying to decide 
 on what is the best way to index and search Hebrew texts.
 
 To this end I decided to put some of the development efforts of the HebMorph 
 project into making tools for the ORP. I have announced this before, but 
 unfortunately I had to attend to more pressing tasks before I could complete 
 this (and there was no response from the community anyway...). Just in case 
 you're interested in seeing what I came up with so far: 
 http://github.com/synhershko/Orev.

If you can, putting them up as a patch would be useful.  That way, we can show 
some progress.

 
 IMHO, the ORP should stand by itself, and relate to Lucene/Solr only as its 
 basis framework for these initial stages. Perhaps also try to attract more 
 people who could find an interest in what it has to offer, so it can really 
 start growing.
 
 Itamar.
 
 On 12/9/2010 1:29 PM, Grant Ingersoll wrote:
 On Sep 11, 2010, at 8:51 PM, Robert Muir wrote:
 
   
 i propose we take what we have and import into lucene-java's benchmark
 contrib.  it already has integration with wikipedia and reuters for perf
 purposes, and the quality package is actually there anyways.  later, maybe
 more people have time and contrib/benchmark evolves naturally... e.g. to
 modules/benchmark with solr support as a first big step.
 
 Yeah, that seems reasonable.  I have been thinking lately that it might be 
 useful to pull our DocMaker stuff out separately from benchmark so that 
 people have easy ways of generating content from things like Wikipedia, etc.
 
 Still, at the end of the day, I like what ORP _could_ bring to the table and 
 to some extent I think that is lost by folding it into Lucene benchmark.
 
   
 On Sep 11, 2010 7:33 PM, Grant Ingersollgsing...@apache.org  wrote:
 
 Seems ORP isn't really catching on with people. I know personally I don't
   
 have the time I had hoped to have to get it going. At the same time, I
 really think it could be a good project. We've got some tools put together,
 but we still haven't done much about the bigger goal of a self contained
 evaluation.
 
 Any thoughts on how we should proceed with ORP?
 
 -Grant
   
 
 
   

--
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: Whither ORP?

2010-09-14 Thread Grant Ingersoll

I think the biggest hurdle we have in front of us is curating a data set that 
we can redistribute.  I'm in the process of uploading all the ASF public mail 
archives as of Sept. 13 to Amazon S3.  I also have some tools (thanks to Chris 
Rhodes) for processing this into Solr XML.  I think this would give us a 
standard corpus to start with and would fairly well mimic some enterprise 
search/eDiscovery tasks pretty well.

At any rate, as with any community, the proof is in people stepping up to help 
out.  I like that so many people suggested we keep going.  As for what to do, I 
think the options are pretty wide open and there is opportunity for people to 
define the project w/o any previous encumbrances. 

Some ideas that have been kicked around in the past:
1. Creative-commons data set, judgments, queries
2. Open Street Map (spatial search)
3. Mail archives
4. A crowd sourcing application.  Given a set of documents and queries, have 
people provide judgments.  Ideally, this runs in a web container and we could 
probably even find resources to host it here.  Combining that with one of the 
items above, we would be on our way.  App could also solicit queries by 
providing users open search box and opportunities to browse the data.

I know much of this is simplistic, but it is a start.

-Grant


On Sep 13, 2010, at 9:04 PM, Dan Cardin wrote:

 Hello,
 
 I am new to ORP. I would like to contribute to the project. I do not have a
 lot of experience in this field of IR, crowd sourcing or AI. If someone
 could take the lead and set forward path I would be willing to contribute my
 skill set to ORP.
 
 How can I help? I have a lot of experience doing software development and
 system administration.
 
 Cheers,
 --Dan
 
 On Mon, Sep 13, 2010 at 1:36 PM, Omar Alonso oralo...@yahoo.com wrote:
 
 I think ORP is a great candidate for crowdsourcing/human computation. In
 the last year or so there's been quite a bit of research and applications on
 this. See the page for the SIGIR workshop on using crowdsourcing for IR
 evaluation: 
 http://www.ischool.utexas.edu/~cse2010/http://www.ischool.utexas.edu/%7Ecse2010/
 
 Omar
 
 --- On Mon, 9/13/10, Itamar Syn-Hershko ita...@code972.com wrote:
 
 From: Itamar Syn-Hershko ita...@code972.com
 Subject: Re: Whither ORP?
 To: openrelevance-...@lucene.apache.org
 Date: Monday, September 13, 2010, 9:33 AM
 With the proper two-way open-source
 development process (taking and then giving) I think it can
 become an important part of open-IR technologies, just like
 what Lucene did to the search engines world. What ORP has to
 offer is of great interest to HebMorph, an open-source
 project of mine trying to decide on what is the best way to
 index and search Hebrew texts.
 
 To this end I decided to put some of the development
 efforts of the HebMorph project into making tools for the
 ORP. I have announced this before, but unfortunately I had
 to attend to more pressing tasks before I could complete
 this (and there was no response from the community
 anyway...). Just in case you're interested in seeing what I
 came up with so far: http://github.com/synhershko/Orev.
 
 IMHO, the ORP should stand by itself, and relate to
 Lucene/Solr only as its basis framework for these initial
 stages. Perhaps also try to attract more people who could
 find an interest in what it has to offer, so it can really
 start growing.
 
 Itamar.
 
 On 12/9/2010 1:29 PM, Grant Ingersoll wrote:
 On Sep 11, 2010, at 8:51 PM, Robert Muir wrote:
 
 
 i propose we take what we have and import into
 lucene-java's benchmark
 contrib.  it already has integration with
 wikipedia and reuters for perf
 purposes, and the quality package is actually
 there anyways.  later, maybe
 more people have time and contrib/benchmark
 evolves naturally... e.g. to
 modules/benchmark with solr support as a first big
 step.
 
 Yeah, that seems reasonable.  I have been
 thinking lately that it might be useful to pull our DocMaker
 stuff out separately from benchmark so that people have easy
 ways of generating content from things like Wikipedia, etc.
 
 Still, at the end of the day, I like what ORP _could_
 bring to the table and to some extent I think that is lost
 by folding it into Lucene benchmark.
 
 
 On Sep 11, 2010 7:33 PM, Grant Ingersollgsing...@apache.org
 wrote:
 
 Seems ORP isn't really catching on with
 people. I know personally I don't
 
 have the time I had hoped to have to get it going.
 At the same time, I
 really think it could be a good project. We've got
 some tools put together,
 but we still haven't done much about the bigger
 goal of a self contained
 evaluation.
 
 Any thoughts on how we should proceed with
 ORP?
 
 -Grant
 
 
 
 
 
 
 
 

--
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

[jira] Commented: (LUCENE-2643) StringHelper#stringDifference is wrong about supplementary chars


[ 
https://issues.apache.org/jira/browse/LUCENE-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909253#action_12909253
 ] 

Robert Muir commented on LUCENE-2643:
-

My vote would be to drop it if we arent using it, its @lucene.internal.

since its unused, its not obvious that its wrong (its correct if you want the 
first code unit difference)


 StringHelper#stringDifference is wrong about supplementary chars 
 -

 Key: LUCENE-2643
 URL: https://issues.apache.org/jira/browse/LUCENE-2643
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0, 3.0.1, 3.0.2
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Trivial
 Fix For: 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2643.patch


 StringHelper#stringDifference does not take supplementary characters into 
 account. Since this is not used internally at all we should think about 
 removing it but I guess since it is not too complex we should just or fix it 
 for bwcompat reasons. For released versions we should really fix it since 
 folks might use it though. For trunk we could just drop it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2643) StringHelper#stringDifference is wrong about supplementary chars


 [ 
https://issues.apache.org/jira/browse/LUCENE-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2643:


Attachment: LUCENE-2643.patch

here is a patch

 StringHelper#stringDifference is wrong about supplementary chars 
 -

 Key: LUCENE-2643
 URL: https://issues.apache.org/jira/browse/LUCENE-2643
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0, 3.0.1, 3.0.2
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Trivial
 Fix For: 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2643.patch


 StringHelper#stringDifference does not take supplementary characters into 
 account. Since this is not used internally at all we should think about 
 removing it but I guess since it is not too complex we should just or fix it 
 for bwcompat reasons. For released versions we should really fix it since 
 folks might use it though. For trunk we could just drop it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2643) StringHelper#stringDifference is wrong about supplementary chars

[
https://issues.apache.org/jira/browse/LUCENE-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909256#action_12909256
]

Simon Willnauer commented on LUCENE-2643:
-

bq. since its unused, its not obvious that its wrong (its correct if you want
the first code unit difference)

yeah - my interpretation would be its wrong since you use Character.charAt(int)
with the index of the first code unit. anyway - we should drop for trunk but I
am not sure if we should for 3.x. I mean this is not that much of a deal anyway.

StringHelper#stringDifference is wrong about supplementary chars
-

Key: LUCENE-2643
URL: https://issues.apache.org/jira/browse/LUCENE-2643
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 3.0, 3.0.1, 3.0.2
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Trivial
Fix For: 3.0.3, 3.1, 4.0

Attachments: LUCENE-2643.patch

StringHelper#stringDifference does not take supplementary characters into
account. Since this is not used internally at all we should think about
removing it but I guess since it is not too complex we should just or fix it
for bwcompat reasons. For released versions we should really fix it since
folks might use it though. For trunk we could just drop it.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2643) StringHelper#stringDifference is wrong about supplementary chars


[ 
https://issues.apache.org/jira/browse/LUCENE-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909260#action_12909260
 ] 

Robert Muir commented on LUCENE-2643:
-

drop in trunk and mark deprecated in 3.x?

regardless of whether its right or wrong, if we arent using it, i think its 
good to clean house.


 StringHelper#stringDifference is wrong about supplementary chars 
 -

 Key: LUCENE-2643
 URL: https://issues.apache.org/jira/browse/LUCENE-2643
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0, 3.0.1, 3.0.2
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Trivial
 Fix For: 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2643.patch


 StringHelper#stringDifference does not take supplementary characters into 
 account. Since this is not used internally at all we should think about 
 removing it but I guess since it is not too complex we should just or fix it 
 for bwcompat reasons. For released versions we should really fix it since 
 folks might use it though. For trunk we could just drop it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Whither ORP?

2010-09-14 Thread Dan Cardin

Hello,

This is a great start! I am interested in helping with the development of a
crowd sourcing application. The next step would be creating a set of
requirements for the web app. Would the ORP wiki be a good place to store
the requirements?

--Dan


On Tue, Sep 14, 2010 at 9:51 AM, Grant Ingersoll gsing...@apache.orgwrote:

 I think the biggest hurdle we have in front of us is curating a data set
 that we can redistribute.  I'm in the process of uploading all the ASF
 public mail archives as of Sept. 13 to Amazon S3.  I also have some tools
 (thanks to Chris Rhodes) for processing this into Solr XML.  I think this
 would give us a standard corpus to start with and would fairly well mimic
 some enterprise search/eDiscovery tasks pretty well.

 At any rate, as with any community, the proof is in people stepping up to
 help out.  I like that so many people suggested we keep going.  As for what
 to do, I think the options are pretty wide open and there is opportunity for
 people to define the project w/o any previous encumbrances.

 Some ideas that have been kicked around in the past:
 1. Creative-commons data set, judgments, queries
 2. Open Street Map (spatial search)
 3. Mail archives
 4. A crowd sourcing application.  Given a set of documents and queries,
 have people provide judgments.  Ideally, this runs in a web container and we
 could probably even find resources to host it here.  Combining that with one
 of the items above, we would be on our way.  App could also solicit queries
 by providing users open search box and opportunities to browse the data.

 I know much of this is simplistic, but it is a start.

 -Grant


 On Sep 13, 2010, at 9:04 PM, Dan Cardin wrote:

  Hello,
 
  I am new to ORP. I would like to contribute to the project. I do not have
 a
  lot of experience in this field of IR, crowd sourcing or AI. If someone
  could take the lead and set forward path I would be willing to contribute
 my
  skill set to ORP.
 
  How can I help? I have a lot of experience doing software development and
  system administration.
 
  Cheers,
  --Dan
 
  On Mon, Sep 13, 2010 at 1:36 PM, Omar Alonso oralo...@yahoo.com wrote:
 
  I think ORP is a great candidate for crowdsourcing/human computation. In
  the last year or so there's been quite a bit of research and
 applications on
  this. See the page for the SIGIR workshop on using crowdsourcing for IR
  evaluation: 
  http://www.ischool.utexas.edu/~cse2010/http://www.ischool.utexas.edu/%7Ecse2010/
 http://www.ischool.utexas.edu/%7Ecse2010/
 
  Omar
 
  --- On Mon, 9/13/10, Itamar Syn-Hershko ita...@code972.com wrote:
 
  From: Itamar Syn-Hershko ita...@code972.com
  Subject: Re: Whither ORP?
  To: openrelevance-...@lucene.apache.org
  Date: Monday, September 13, 2010, 9:33 AM
  With the proper two-way open-source
  development process (taking and then giving) I think it can
  become an important part of open-IR technologies, just like
  what Lucene did to the search engines world. What ORP has to
  offer is of great interest to HebMorph, an open-source
  project of mine trying to decide on what is the best way to
  index and search Hebrew texts.
 
  To this end I decided to put some of the development
  efforts of the HebMorph project into making tools for the
  ORP. I have announced this before, but unfortunately I had
  to attend to more pressing tasks before I could complete
  this (and there was no response from the community
  anyway...). Just in case you're interested in seeing what I
  came up with so far: http://github.com/synhershko/Orev.
 
  IMHO, the ORP should stand by itself, and relate to
  Lucene/Solr only as its basis framework for these initial
  stages. Perhaps also try to attract more people who could
  find an interest in what it has to offer, so it can really
  start growing.
 
  Itamar.
 
  On 12/9/2010 1:29 PM, Grant Ingersoll wrote:
  On Sep 11, 2010, at 8:51 PM, Robert Muir wrote:
 
 
  i propose we take what we have and import into
  lucene-java's benchmark
  contrib.  it already has integration with
  wikipedia and reuters for perf
  purposes, and the quality package is actually
  there anyways.  later, maybe
  more people have time and contrib/benchmark
  evolves naturally... e.g. to
  modules/benchmark with solr support as a first big
  step.
 
  Yeah, that seems reasonable.  I have been
  thinking lately that it might be useful to pull our DocMaker
  stuff out separately from benchmark so that people have easy
  ways of generating content from things like Wikipedia, etc.
 
  Still, at the end of the day, I like what ORP _could_
  bring to the table and to some extent I think that is lost
  by folding it into Lucene benchmark.
 
 
  On Sep 11, 2010 7:33 PM, Grant Ingersollgsing...@apache.org
  wrote:
 
  Seems ORP isn't really catching on with
  people. I know personally I don't
 
  have the time I had hoped to have to get it going.
  At the same time, I
  really think it could be a good project. We've got
  some tools put

Re: Whither ORP?

2010-09-14 Thread Robert Muir

On Tue, Sep 14, 2010 at 10:22 AM, Dan Cardin dcardin2...@gmail.com wrote:

 Hello,

 This is a great start! I am interested in helping with the development of a
 crowd sourcing application. The next step would be creating a set of
 requirements for the web app. Would the ORP wiki be a good place to store
 the requirements?


+1, don't hold back!

-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (LUCENE-2504) sorting performance regression


[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909272#action_12909272
 ] 

Yonik Seeley commented on LUCENE-2504:
--

Looks like we're not using the correct comparators everywhere.
I was trying a slightly different way to implement sort-missing-last, and my 
first comparator only implements setNextReader(), but I'm now getting many 
UnsupportedOperationExceptions (i.e. the search process is using older 
comparators after calling setNextReader())

One culprit is OneComparatorNonScoringCollector, and another is 
OneComparatorFieldValueHitQueue I think.


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Whither ORP?

2010-09-14 Thread Simon Willnauer

On Tue, Sep 14, 2010 at 4:30 PM, Robert Muir rcm...@gmail.com wrote:
 On Tue, Sep 14, 2010 at 10:22 AM, Dan Cardin dcardin2...@gmail.com wrote:

 Hello,

 This is a great start! I am interested in helping with the development of a
 crowd sourcing application. The next step would be creating a set of
 requirements for the web app. Would the ORP wiki be a good place to store
 the requirements?


 +1, don't hold back!

+1 - we need some action here! go for it!

 --
 Robert Muir
 rcm...@gmail.com

[jira] Updated: (LUCENE-2630) make the build more friendly to apache harmony

[
https://issues.apache.org/jira/browse/LUCENE-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2630:

Attachment: LUCENE-2630.patch

The harmony developers applied the UTF-8 fix (HARMONY-6640), so
we don't need to hack MockTokenizer anymore.

i've updated patch, 'ant test-core -Dbuild.compiler=extJavac' almost passes.

i'll iterate with some more test improvements now that we are going somewhere.

make the build more friendly to apache harmony
--

Key: LUCENE-2630
URL: https://issues.apache.org/jira/browse/LUCENE-2630
Project: Lucene - Java
Issue Type: Task
Components: Build, Tests
Affects Versions: 4.0
Reporter: Robert Muir
Attachments: LUCENE-2630.patch, LUCENE-2630.patch

as part of improved testing, i thought it would be a good idea to make the
build (ant test) more friendly
to working under apache harmony.
i'm not suggesting we de-optimize code for sun jvms or anything crazy like
that, only use it as a tool.
for example:
* bugs in tests/code: for example i found a test that expected ArrayIOOBE
when really the javadoc contract for the method is just IOOBE... it just
happens to
pass always on sun jvm because thats the implementation it always throws.
* better reproduction of bugs: for example [2 months out of the
year|http://en.wikipedia.org/wiki/Unusual_software_bug#Phase_of_the_Moon_bug]
it seems TestQueryParser fails with thai locale in a difficult-to-reproduce
way.
but i *always* get similar failures like this with harmony for this test
class.
* better stability and portability: we should try (if reasonable) to avoid
depending
upon internal details. the same kinds of things that fail in harmony might
suddenly
fail in a future sun jdk. because its such a different impl, it brings out
a lot of interesting stuff.
at the moment there are currently a lot of failures, I think a lot might be
caused by this: http://permalink.gmane.org/gmane.comp.java.harmony.devel/39484

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: exceptions from solr/contrib/dataimporthandler and solr/contrib/extraction

2010-09-14 Thread Grant Ingersoll


On Sep 13, 2010, at 1:59 PM, Lance Norskog wrote:

 What I want you to do is, I want you to find the guys who are putting
 all the bugs in the code, and I want you to FIRE THEM!

He who is without bugs in their code, may be the first to fire.

-Grant

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2630) make the build more friendly to apache harmony

[
https://issues.apache.org/jira/browse/LUCENE-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2630:

Attachment: LUCENE-2630_charutils.patch

make the build more friendly to apache harmony
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2504) sorting performance regression


[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909312#action_12909312
 ] 

Michael McCandless commented on LUCENE-2504:


bq. I'm now getting many UnsupportedOperationExceptions (i.e. the search 
process is using older comparators after calling setNextReader())

That's no good!

bq. One culprit is OneComparatorNonScoringCollector, and another is 
OneComparatorFieldValueHitQueue I think.

Hmm I don't see the problem -- eg OneComparatorNonScoringCollector saves the 
returned comparator from comparator.setNextReader.

Can you post the full exc?

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2504) sorting performance regression


 [ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-2504:
-

Attachment: LUCENE-2504.patch

Attaching a draft patch that seems to fix the issue (the ones I can find at 
least).

bq. Hmm I don't see the problem - eg OneComparatorNonScoringCollector saves the 
returned comparator from comparator.setNextReader.

Yes, but FieldValueHitQueue has it's own list of comparators that never get 
updated.

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.patch, 
 LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2504) sorting performance regression

[
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909337#action_12909337
]

Michael McCandless commented on LUCENE-2504:

{quote}
I think we all owe it to ourselves to stop equating java with Oracle, if Java
stays with Oracle its pretty obvious the language (is) will die anyway.
{quote}

Yeah I agree.

The open question is whether this hotspot fickleness is particular to
Oracle's java impl, or, is somehow endemic to bytecode VMs (.NET
included). It's really a hard, complex problem (JIT compilation from
bytecode based on runtime data), so it wouldn't surprise me if it's
the latter, to varying degrees.

bq. .NET is not a choice but generating C/C++ code is?

As far as I know it's much easier to invoke C/C++ from java, than .NET
from java. C/C++ is also more portable than .NET, I think? (There is
Mono -- how mature is it by now?).

{quote}
I don't think we should jump the gun and make real design/architectural
choices based on Oracle bugs.
{quote}

I expect source code spec will also buy sizable perf gains
irrespective of hotspot fickleness, and in non-Oracle java impls.
Generating a dedicated class, with one method doing all searching and
collecting, removes all kinds of barriers to the JIT compiler. It
makes its job far easier.

bq. I agree with robert that we should stop comparing against sun jvms all the
time and turn everything upside-down specializing code here and there or go one
step further and generate C++ code. Dude who is gonna maintain the
compatibility to Java-Only environments?

If we manage to pursue specialized code gen, it'll be a lng time
coming! My point about C/C++ is that if we do somehow manage to get a
working code gen framework online (for Java), the added cost to make
it also target C/C++ will be relatively small. Ie, it's nearly for
free.

If we were do to this, that would not mean we'd abandon java, of
course -- the framework would fully support pure java as well.

bq. I think that code specializations of very hot part of lucene are ok and
we should follow that way like we did at some places but it already make things
very complicated to follow.

You mean manual specialization right (like this issue)?

Yes, I think we will have to keep manually specializing, going
forward, until we can have code generator that
does it more cleanly...

bq. Would it make way more sense to push OSS JVMs than spending lots of time on
investigating on .NET as an alternative or C/C++ code generator?

I think we should do both.

bq. Before I would go the C++ path I'd rather use Java to host a C core like
lucy which brings you as close as it gets to the machine.

I think this (a Java wrapper for Lucy) is a great idea -- we should explore
that, too.

bq. interesting papers - seems we are touching the limits of Java though.

Well that's the big question -- limits of Java or limit's of Sun/Oracle's impl.

It looks like harmony has a ways to go on absolute performance: I just
ran a very quick benchmark (TermQuery search on 10 M multi-segment
wiki index w/ a 50% random filter) and Oracle java 1.6.0_21 gets 15.6
QPS while Harmony 1.5.0-r946978 gets 9.5 QPS (Harmony 1.6.0-r946981
also gets 9.5 QPS). I just ran java -server -Xms2g -Xmx2g; it's
possible by tuning Harmony (it has many awesome looking command-line
args!) it'd get faster...

sorting performance regression
--

Key: LUCENE-2504
URL: https://issues.apache.org/jira/browse/LUCENE-2504
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
Fix For: 4.0

Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.patch,
LUCENE-2504.zip

sorting can be much slower on trunk than branch_3x

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-2504) sorting performance regression


 [ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-2504:



bq. Yes, but FieldValueHitQueue has it's own list of comparators that never get 
updated.

Ugh, yes.

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.patch, 
 LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-14 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

Term frequency is recorded and returned. There are Terms, TermsEnum, DocsEnum
implementations. Needs the term vectors, doc stores exposed via the RAM
reader, concurrency unit tests, and a payload unit test. Still quite rough.

Concurrent byte and int block implementations
-

Key: LUCENE-2575
URL: https://issues.apache.org/jira/browse/LUCENE-2575
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Fix For: Realtime Branch

Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch

The current *BlockPool implementations aren't quite concurrent.
We really need something that has a locking flush method, where
flush is called at the end of adding a document. Once flushed,
the newly written data would be available to all other reading
threads (ie, postings etc). I'm not sure I understand the slices
concept, it seems like it'd be easier to implement a seekable
random access file like API. One'd seek to a given position,
then read or write from there. The underlying management of byte
arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Whither ORP?

2010-09-14 Thread Itamar Syn-Hershko


On 14/9/2010 4:22 PM, Dan Cardin wrote:

Hello,

This is a great start! I am interested in helping with the development of a
crowd sourcing application. The next step would be creating a set of
requirements for the web app. Would the ORP wiki be a good place to store
the requirements?

--Dan


Uhm... this is actually what I just said I'm in the middle of doing. But 
perhaps doing some spec'ing through the Wiki would end in a better 
product, so why not.


Please see 
http://search-lucene.com/m/pLgxg1HCef11subj=OpenRelevance+Viewer+Orev+ 
http://search-lucene.com/m/pLgxg1HCef11subj=OpenRelevance+Viewer+Orev+ to 
get an idea of what I did there. Let's branch the discussion from there 
to get this going in the right direction...


As I wrote in the other message, this app can be accessed through 
http://github.com/synhershko/Orev (.NET / C# / NHibernate), and there's 
still some to do there.


Itamar.

Re: Whither ORP?

2010-09-14 Thread Itamar Syn-Hershko


On 14/9/2010 3:44 PM, Grant Ingersoll wrote:

If you can, putting them up as a patch would be useful.  That way, we can show 
some progress.


I will, but first it needs to be workable. It is 80% done, but still not 
that usable. I expect to be able to work on it again in a month or so. 
Or someone else could resume from where I stopped (in .NET, or port it 
to Java). I'm can share what is missing if anyone is interested.


Itamar.

[jira] Commented: (LUCENE-2630) make the build more friendly to apache harmony

[
https://issues.apache.org/jira/browse/LUCENE-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909347#action_12909347
]

Simon Willnauer commented on LUCENE-2630:
-

bq. Here's the patch for TestCharacterUtils.
looks good to me! go commit!

make the build more friendly to apache harmony
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Whither ORP?

2010-09-14 Thread Dan Cardin

Hello,

I will begin documenting some basic requirements for a crowd sourcing web
app. I will use some of the work done by Itamar as a basis for the
requirements.

--Dan

On Tue, Sep 14, 2010 at 1:18 PM, Itamar Syn-Hershko ita...@code972.comwrote:

 On 14/9/2010 3:44 PM, Grant Ingersoll wrote:

 If you can, putting them up as a patch would be useful.  That way, we can
 show some progress.


 I will, but first it needs to be workable. It is 80% done, but still not
 that usable. I expect to be able to work on it again in a month or so. Or
 someone else could resume from where I stopped (in .NET, or port it to
 Java). I'm can share what is missing if anyone is interested.

 Itamar.

[jira] Updated: (LUCENE-2630) make the build more friendly to apache harmony

[
https://issues.apache.org/jira/browse/LUCENE-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Muir updated LUCENE-2630:

Attachment: LUCENE-2630_intl.patch

here's a patch for the internationalization differences, since harmony uses ICU.
* the collator gives different order for Locale.US, though
its wierd we test the order of non-US characters under US Locale (its not
defined and inherited from root locale)
I conditionalized this test as such:
{code}
// the sort order of Ø versus U depends on the version of the rules being used
// for the inherited root locale: Ø's order isnt specified in Locale.US since
// its not used in english.
private boolean oStrokeFirst = Collator.getInstance(new
Locale()).compare(Ø, U) 0;
{code}
* the thai dictionary-based break iterator gives different results: I used text
that both impls segment the same way.

make the build more friendly to apache harmony
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2504) sorting performance regression


[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909407#action_12909407
 ] 

Yonik Seeley commented on LUCENE-2504:
--

bq. The open question is whether this hotspot fickleness is particular to 
Oracle's java impl, or, is somehow endemic to bytecode VMs (.NET included).

I tried IBM's latest Java6 (SR8 FP1, 20100624)
It seems to have some of the same pitfalls as Oracle's JVM, just different.
The first run does not differ from the second run in the same JVM as it does 
with Oracle, but the first run itself has much more variation.  The worst case 
is worse, and just like the Oracle JVM, it gets stuck in it's worst case.

Each run (of the complete set of fields) in a separate JVM since two runs in 
the same JVM didn't really differ as they did in the oracle JVM.


branch_3x:
|unique terms in field|median sort time of 100 sorts in ms|another run|another 
run|another run|another run|another run|another run
|10|129|128|130|109|98|128|135
|1|128|123|127|127|98|128|135
|1000|129|130|130|128|98|130|136
|100|128|133|133|130|100|132|139
|10|150|153|153|154|122|153|159

trunk:
|unique terms in field|median sort time of 100 sorts in ms|another run|another 
run|another run|another run|another run|another run
|10|217|81|383|99|79|78|215
|1|254|73|346|101|106|108|267
|1000|253|74|347|99|107|108|258
|100|253|107|394|98|107|102|255
|10|251|107|388|99|106|98|257

The second way of testing is to completely mix fields (no serial correlation 
between what field is sorted on).  This is the test that is very predictable 
with the Oracle JVM, but I still see wide variability with the IBM JVM.  Here 
is the list of different runs for the Oracle JVM (ms):

branch_3x
|128|129|123|120|128|100|95|74|130|91|120

trunk
|106|89|168|116|155|119|108|118|112|169|165

To my eye, it looks like we have more variability in trunk, due to increased 
use of abstractions?


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.patch, 
 LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2504) sorting performance regression


[ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909456#action_12909456
 ] 

Yonik Seeley commented on LUCENE-2504:
--

OK, I've committed the fix to always use the latest generation field comparator.
Not sure if this is the best way to handle - but at least it's correct now and 
we can improve more later.

 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.patch, 
 LUCENE-2504.zip


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (SOLR-2120) Facet Field Value truncation

2010-09-14 Thread Niall O'Connor (JIRA)

Facet Field Value truncation


 Key: SOLR-2120
 URL: https://issues.apache.org/jira/browse/SOLR-2120
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.4.1
Reporter: Niall O'Connor


There is a limit on the length of indexed string values of 256 characters, this 
results in undesirable behavior for facet field values for example: 

lst name=facet_fields
lst name=pub_articletitle
int name=12302/int
int name=hiv1403/int
int name=type1382/int
/lst
lst name=tissue-antology
int name=Lymph node,Organ component,Cardinal organ part,Anatomical 
structure,Material anatomical entity,Physical anatomical entity,Anatomical 
entity419/int
int name=Left frontal lobe,Frontal lobe,Lobe of cerebral hemisphere,Segment 
of cerebral hemisphere,Segment of telencephalon,Segment of forebrain,Segment of 
brain,Segment of neuraxis,Organ segment,Organ region,Cardinal organ 
part,Anatomical structure,*Material anatom236/int
int name=ical entity,Physical anatomical entity,Anatomical entity236/int*
/lst  

The last facet in the list is being truncated and spills into a new facet. This 
also eats up a facet since i usually only return the top 3. 

Is 256 characters a hard limit in the indexing strategy?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2644) LowerCaseTokenizer Does Not Behave As One Might Expect (or Desire)--Given Its Name

2010-09-14 Thread Scott Gonyea (JIRA)

LowerCaseTokenizer Does Not Behave As One Might Expect (or Desire)--Given Its 
Name
--

 Key: LUCENE-2644
 URL: https://issues.apache.org/jira/browse/LUCENE-2644
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0.2
Reporter: Scott Gonyea
 Fix For: 3.0.3, 3.1, Realtime Branch, 4.0


While I understand some of the reasons for its design, the original 
LowerCaseTokenizer should have been named LowerCaseLetterTokenizer.

I feel that LowerCaseTokenizer makes too many assumptions about what too 
tokenize, and I have therefore patched it.  The *default* behavior will remain 
as it always has--to avoid breaking any implementations for which it's being 
used.

I have changed LowerCaseTokenizer to extend CharTokenizer (rather than 
LetterTokenizer).  LetterTokenizer's functionality was merged into the default 
behavior of LowerCaseTokenizer.

Getter/Setter methods have been added to the LowerCaseTokenizer Class, allowing 
you to turn on / off tokenizing by white space, numbers, and special 
(Non-Alpha/Numeric) characters.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2644) LowerCaseTokenizer Does Not Behave As One Might Expect (or Desire)--Given Its Name

2010-09-14 Thread Scott Gonyea (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Scott Gonyea updated LUCENE-2644:
-

Attachment: LowerCaseTokenizer.patch

This patch will retain original functionality, while permitting the user to
modify the assumptions on which tokens are built.

LowerCaseTokenizer Does Not Behave As One Might Expect (or Desire)--Given Its
Name
--

Key: LUCENE-2644
URL: https://issues.apache.org/jira/browse/LUCENE-2644
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Affects Versions: 3.0.2
Reporter: Scott Gonyea
Fix For: 3.0.3, 3.1, Realtime Branch, 4.0

Attachments: LowerCaseTokenizer.patch

While I understand some of the reasons for its design, the original
LowerCaseTokenizer should have been named LowerCaseLetterTokenizer.
I feel that LowerCaseTokenizer makes too many assumptions about what too
tokenize, and I have therefore patched it. The *default* behavior will
remain as it always has--to avoid breaking any implementations for which it's
being used.
I have changed LowerCaseTokenizer to extend CharTokenizer (rather than
LetterTokenizer). LetterTokenizer's functionality was merged into the
default behavior of LowerCaseTokenizer.
Getter/Setter methods have been added to the LowerCaseTokenizer Class,
allowing you to turn on / off tokenizing by white space, numbers, and special
(Non-Alpha/Numeric) characters.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2504) sorting performance regression


 [ 
https://issues.apache.org/jira/browse/LUCENE-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-2504:
-

Attachment: LUCENE-2504_SortMissingLast.patch

This was a simple attempt to try and simplify the comparators.   Static classes 
are used instead of inner classes.  Unfortunately, it didn't help the JVMs from 
getting stuck in badly optimized code (it was a long shot for that), but it 
does result in a consistent 4% speedup.

It looks as simple as the previous version to my eye, so I'll commit if there 
are no objections.


 sorting performance regression
 --

 Key: LUCENE-2504
 URL: https://issues.apache.org/jira/browse/LUCENE-2504
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: LUCENE-2504.patch, LUCENE-2504.patch, LUCENE-2504.patch, 
 LUCENE-2504.zip, LUCENE-2504_SortMissingLast.patch


 sorting can be much slower on trunk than branch_3x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-14 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Rutherglen updated LUCENE-2575:
-

Attachment: LUCENE-2575.patch

Added a unit test for payloads, term vectors, and doc stores. The reader
flushes term vectors and doc stores on demand, once per reader. Also, little
things are getting cleaned up in the realtime branch.

Concurrent byte and int block implementations
-

Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch,
LUCENE-2575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (SOLR-1194) Query Analyzer not Invoking for Custom FiledType - When we use Custom QParser Plugin

2010-09-14 Thread Hoss Man (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-1194.


Resolution: Invalid

This sounds like a bug in your custom QParser -- the QParser is what calls the 
analyzer and constructs the query.

w/o any information as to how FPersonQParserPlugin is implemented, there 
doesn't seem to be a bug here.

If your issue is that you have questions about how to implement 
.FPersonQParserPlugin properly so thta it uses the field's analyzer, please 
post that as a question to the solr-user mailing list

 Query Analyzer not Invoking for Custom FiledType - When we use Custom QParser 
 Plugin
 

 Key: SOLR-1194
 URL: https://issues.apache.org/jira/browse/SOLR-1194
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.3
 Environment: Windows, Java 1.6. Solr 1.3
Reporter: Nagarajan.shanmugam
   Original Estimate: 2h
  Remaining Estimate: 2h

 Hi I  Created Custom Solr Field kwd_names in
 schema.xml
 fieldType name=kwd_names class=solr.TextField positionIncrementGap=100
   analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory 
 /
   filter class=solr.TrimFilterFactory /
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.PhoneticFilterFactory 
 encoder=Metaphone inject=true/  
   /analyzer
   analyzer type=index
   tokenizer class=solr.KeywordTokenizerFactory 
 /
   filter class=solr.TrimFilterFactory /   
 
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.PhoneticFilterFactory 
 encoder=Metaphone inject=true/  
   /analyzer 
   /fieldType
 I configured requestHandler in solrConfig.xml with Custom QparserPlugin
 requestHandler name=fperson class=solr.SearchHandler
 !-- default values for query parameters --
  lst name=defaults
str name=echoParamsexplicit/str
str name=defTypefpersonQueryParser/str
  /lst
  /requestHandler
 queryParser name=fpersonQueryParser 
   
 class=com.thinkronize.edudym.search.analysis.FPersonQParserPlugin /
   SolrQuery q = new SolrQuery();
   q.setParam(q, George);
   q.setParam(gender, M);
   q.setQueryType(FPersonSearcher.QUERY_TYPE);
   server.query(q);
 When I fire Query it wont invoke the QueryAnlayzer it Doesnt give any result. 
 But if i remove q.setQueryType its invoking the query analyzer and its giving 
 results 
 That mean QueryAnalyzer for that field not invoked when i use CustomQParser 
 Plugin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-2119) IndexSchema should log warning if analyzer is declared with charfilter/tokenizer/tokenfiler out of order