[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552497#comment-13552497
 ] 

Dawid Weiss commented on LUCENE-4682:
-

Yeah, there are many ideas layered on top of each other and it's gotten beyond 
the point of being easy to comprehend.

As for the next bit -- in any implementation I've seen this leads to 
significant reduction in automaton size. But I'm not saying it's the optimal 
way to do it, perhaps there are other encoding options that would reach similar 
compression levels without the added complexity.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

2013-01-14 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552498#comment-13552498
 ] 

Dawid Weiss commented on LUCENE-3298:
-

The impact will show on 32-bit systems I'm pretty sure of that. We don't care 
about hardware archaeology, do we? :)
+1.

 FST has hard limit max size of 2.1 GB
 -

 Key: LUCENE-3298
 URL: https://issues.apache.org/jira/browse/LUCENE-3298
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, 
 LUCENE-3298.patch


 The FST uses a single contiguous byte[] under the hood, which in java is 
 indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
 internally encodes references to this array as vInt.
 We could switch this to a paged byte[] and make the far larger.
 But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment

2013-01-14 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-4683:
---

Attachment: LUCENE-4683.patch

* Added setNextReader to CategoryListIterator (instead of init()) and 
Aggregator.

* Modified StandardFacetsAccumulator to iterate of the segment's atomic readers 
and call setNextReader accordingly.

* Fixed an issue in ScoredDocIdsUtils where it assumed ScoredDocIDs are 
OpenBitSet where for a long time they are FixedBitSet. This caused unnecessary 
copy from FixedBitSet to OpenBitSet.

* Most of the other changes are API changes, i.e. createCategoryListIterator no 
longer takes an IndexReader etc.

I didn't add yet a CHANGES line because I'm not sure if this will make it into 
4.1. Basically it's ready to go in (all tests pass), so I'll check later today 
what's the status of the 4.1 branch and decide accordingly.

This now makes the cutover to DocValues even easier. That's what I'd like to do 
next.

 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shahar Davidson (JIRA)
Shahar Davidson created SOLR-4302:
-

 Summary: Improve CoreAdmin STATUS request response time by 
allowing user to omit the Index info
 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Priority: Minor


In large multicore environments (hundreds+ of cores), the STATUS request may 
take a fair amount of time.
It seems that the majority of time is spent retrieving the index related info.

The suggested improvement allows the user to specify a parameter (indexInfo) 
that if 'false' then index related info (such as segmentCount, sizeInBytes, 
numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' 
(to maintain existing STATUS request behavior).

For example, when tested on a given machine with 380+ solr cores, the full 
STATUS request took 800ms-900ms, whereas using indexInfo=false returned results 
in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shahar Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahar Davidson updated SOLR-4302:
--

Attachment: SOLR-4302.patch

SOLR-4302, apply over trunk 1404975

 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shahar Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552544#comment-13552544
 ] 

Shahar Davidson edited comment on SOLR-4302 at 1/14/13 10:00 AM:
-

Attached suggested patch SOLR-4302.patch. apply over trunk 1404975

  was (Author: shahar.davidson):
SOLR-4302, apply over trunk 1404975
  
 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: looking for package org.apache.lucene.analysis.standard

2013-01-14 Thread JimAld
Thanks to everyone, I feel I'm getting somewhere, but not quite there yet. I
currently have the below in my pom. When I change my import to: 
import org.apache.lucene.queryparser.classic.QueryParser;
Eclipse says it can't find org.apache.lucene.queryparser however, the
maven installer has no such issue. 

The maven installer, does however have an issue with this line:
Analyzer analyzer = new StandardAnalyzer();
It says: 
cannot find symbol
symbol  : constructor StandardAnalyzer()
location: class org.apache.lucene.analysis.standard.StandardAnalyzer
Even though I have the import:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
Which Eclipse has no issue with. 

I've cleaned my project and restarted Eclipse with no improvement to the
differences shown by Eclipse and Maven. Any help much appreciated!

Pom dependencies:
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-core/artifactId
version4.0.0/version
scopeprovided/scope
/dependency
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-analyzers-common/artifactId
version4.0.0/version
scopeprovided/scope
/dependency
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-queryparser/artifactId
version4.0.0/version
scopeprovided/scope
/dependency



--
View this message in context: 
http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033104.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552573#comment-13552573
 ] 

Michael McCandless commented on LUCENE-4620:


This change seemed to lose a bit of performance: look at 1/11/2013 on 
http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html

But, that tests just one dimension (Date), with only 3 ords per doc,
so I had assumed that this just wasn't enough ints being decoded to
see the gains from this bulk decoding.

So, I modified luceneutil to have more facets per doc (avg ~25 ords
per doc across 9 dimensions; 2.5M unique ords), and the results are
still slower:

{noformat}
  TaskQPS base  StdDevQPS comp  StdDevPct 
diff
  HighTerm3.62  (2.5%)3.24  (1.0%)  -10.5% ( -13% -   
-7%)
   MedTerm7.34  (1.7%)6.78  (0.9%)   -7.6% ( -10% -   
-5%)
   LowTerm   14.92  (1.6%)   14.32  (1.2%)   -4.0% (  -6% -   
-1%)
  PKLookup  181.47  (4.7%)  183.04  (5.3%)0.9% (  -8% -   
11%)
{noformat}

This is baffling ... not sure what's up.  I would expect some gains
given that the micro-benchmark showed sizable decode improvements.  It
must somehow be that decode cost is a minor part of facet counting?
(which is not a good sign!: it should be a big part of it...)


 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-4676:
---

Assignee: Simon Willnauer

 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1


 Revision: 1431169
 ant test  -Dtestcase=TestNRTManager 
 -Dtests.method=testThreadStarvationNoDeleteNRTReader 
 -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA 
 -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII 
 -Dtests.dups=500

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552594#comment-13552594
 ] 

Shai Erera commented on LUCENE-4620:


I'm baffled too. There is some overhead with the bulk API, in that it needs to 
{{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I 
believe that this growing should stabilize after few docs (i.e. the array 
becomes large enough). Still, every iteration checks if the array is large 
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can 
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it 
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe 
side it can grow by buf.length, as at the minimum each value occupies only one 
byte. Some other decoders are trickier, but they are not in effect in your test 
above.

But I must admit that I thought it's a no brainer that replacing an iterator 
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} 
shows nice improvements already. And even if decoding values is not the major 
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most 
we shouldn't see big wins?

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552594#comment-13552594
 ] 

Shai Erera edited comment on LUCENE-4620 at 1/14/13 11:51 AM:
--

I'm baffled too. There is some overhead with the bulk API, in that it needs to 
{{grow()}} the {{IntsRef}} (something it didn't need to do before). But I 
believe that this growing should stabilize after few docs (i.e. the array 
becomes large enough). Still, every iteration checks if the array is large 
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can 
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it 
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe 
side it can grow by buf.length, as at the minimum each value occupies only one 
byte. Some other decoders are trickier, but they are not in effect in your test 
above.

But I must admit that I thought it's a no brainer that replacing an iterator 
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} 
shows nice improvements already. And even if decoding values is not the major 
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most 
we shouldn't see big wins?

  was (Author: shaie):
I'm baffled too. There is some overhead with the bulk API, in that it needs 
to {{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I 
believe that this growing should stabilize after few docs (i.e. the array 
becomes large enough). Still, every iteration checks if the array is large 
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can 
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it 
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe 
side it can grow by buf.length, as at the minimum each value occupies only one 
byte. Some other decoders are trickier, but they are not in effect in your test 
above.

But I must admit that I thought it's a no brainer that replacing an iterator 
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} 
shows nice improvements already. And even if decoding values is not the major 
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most 
we shouldn't see big wins?
  
 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552599#comment-13552599
 ] 

Commit Tag Bot commented on LUCENE-4683:


[trunk commit] Shai Erera
http://svn.apache.org/viewvc?view=revisionrevision=1432890

LUCENE-4683: Change Aggregator and CategoryListIterator to be per-segment


 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment

2013-01-14 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-4683.


   Resolution: Fixed
Fix Version/s: 5.0
   4.1

I ran tests few times and all was quiet. Committed to trunk and 4x (add CHANGES 
too).

 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552601#comment-13552601
 ] 

Commit Tag Bot commented on LUCENE-4683:


[branch_4x commit] Shai Erera
http://svn.apache.org/viewvc?view=revisionrevision=1432894

LUCENE-4683: Change Aggregator and CategoryListIterator to be per-segment


 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4321) java.io.FilterReader considered harmful

2013-01-14 Thread Artem Lukanin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-4321:
--

Attachment: NoRandomReadMockTokenizer.java

I had to extend MockTokenizer, because I read the buffer completely to decide, 
what to do with the input (convert or not to something else).

When you use different reading methods randomly, my tests don't pass. If you 
used the same method (may be different) for the complete input string, they 
would pass, but now the output string is messed up, becase some parts of the 
input are converted and some are not.

 java.io.FilterReader considered harmful
 ---

 Key: LUCENE-4321
 URL: https://issues.apache.org/jira/browse/LUCENE-4321
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0-BETA
Reporter: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 NoRandomReadMockTokenizer.java


 See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39
 Reader.java is fine, it has lots of methods like read(), read(char[]), 
 read(CharBuffer), skip(), but these all have default implementations 
 delegating to read(char[], int, int).
 Unfortunately FilterReader delegates too many unnecessary things such as 
 read() and skip() in a broken way. It should have just left these alone.
 This can cause traps for someone upgrading because they have to override 
 multiple methods, when read(char[], int, int) should be enough, and all 
 Reader methods will then work correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-4302:
---

Assignee: Shalin Shekhar Mangar

 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552612#comment-13552612
 ] 

Shai Erera commented on LUCENE-4620:


I made this change to VInt8IntDecoder instead of checking inside the loop:

{code}
int numValues = buf.length; // a value occupies at least 1 byte
if (values.ints.length  numValues) {
  values.grow(numValues);
}
{code}

Ran EncodingSpeed again and compared the results. On average (4 datasets), 
VInt8 achieves a 0.69% speedup, DGap(VInt) 7.85% and 
Sorting(Unique(DGap(VInt))) 10.16%. The last one is the default Encoder, 
thought its decoder is only DGap(VInt), so I'm not sure why the difference 
between that run and the previous one with 7.85%.

However, it does look like it speeds things up...

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4321) java.io.FilterReader considered harmful

2013-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552613#comment-13552613
 ] 

Robert Muir commented on LUCENE-4321:
-

Your charfilter is broken.

 java.io.FilterReader considered harmful
 ---

 Key: LUCENE-4321
 URL: https://issues.apache.org/jira/browse/LUCENE-4321
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0-BETA
Reporter: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 NoRandomReadMockTokenizer.java


 See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39
 Reader.java is fine, it has lots of methods like read(), read(char[]), 
 read(CharBuffer), skip(), but these all have default implementations 
 delegating to read(char[], int, int).
 Unfortunately FilterReader delegates too many unnecessary things such as 
 read() and skip() in a broken way. It should have just left these alone.
 This can cause traps for someone upgrading because they have to override 
 multiple methods, when read(char[], int, int) should be enough, and all 
 Reader methods will then work correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552614#comment-13552614
 ] 

Michael McCandless commented on LUCENE-4682:


I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and 
it was a big hit:

  * With NEXT: 59267794 bytes

  * Without NEXT: 82543993 bytes

So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 
726 ns per lookup to 636 ns.  But: we could get this speedup today, if we just 
fixed setting of a NEXT arc's target to be lazy instead.  Today it's very 
costly for non-array arcs because we scan to the end of all nodes to set the 
target, even if the caller isn't going to use it, which is really ridiculous.

I also tested delta-coding the arc target instead of the abs vInt we have 
today ... it wasn't a real test, instead I just measured how many bytes the 
vInt delta would be vs how many bytes the vInt abs it today, and the results 
were disappointing:

  * Abs vInt (what we do today): 26681349 bytes

  * Delta vInt: 25479316 bytes

Which is surprising ... I guess we don't see much locality for the nodes ... 
or, eg the common suffixes freeze early on and then lots of future nodes refer 
to them.

Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552619#comment-13552619
 ] 

Robert Muir commented on LUCENE-4682:
-

{quote}
So FST would be ~39% larger if we remove NEXT
{quote}

But according to your notes above, we have 28% waste for this (with a long 
output).
Are we making the right tradeoff?

{quote}
Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?
{quote}

Or, not do it at all if we cant figure it out? The reversing holds back other 
improvements so
benchmarking it by itself could be misleading.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552621#comment-13552621
 ] 

Commit Tag Bot commented on SOLR-4302:
--

[trunk commit] Shalin Shekhar Mangar
http://svn.apache.org/viewvc?view=revisionrevision=1432901

SOLR-4302: New parameter 'indexInfo' (defaults to true) in CoreAdmin STATUS 
command can be used to omit index specific information


 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552624#comment-13552624
 ] 

Dawid Weiss commented on LUCENE-4682:
-

bq. I also tested delta-coding the arc target instead of the abs vInt we have 
today ...

I did such experiments when I was working on that paper. Remember -- you don't 
publish negative results, unfortunately. Obviously I didn't try everything but: 
1) NEXT was important, 2) the structure of the FST doesn't yield to easy local 
deltas; it's not easily separable and pointers typically jump all over.

bq. Which is surprising ... I guess we don't see much locality for the nodes 
... or, eg the common suffixes freeze early on and then lots of future nodes 
refer to them.

Not really that surprising -- you encode common suffixes after all so most of 
them will appear in a properly sized sample. This is actually why the strategy 
of moving nodes around works too -- you move those that are super frequent but 
they'll most likely be reordered at the top suffix frequencies of the 
automaton anyway.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-4302.
-

   Resolution: Fixed
Fix Version/s: 5.0
   4.1

Committed to trunk and branch_4x.

Thanks Shahar!

 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Fix For: 4.1, 5.0

 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552628#comment-13552628
 ] 

Commit Tag Bot commented on SOLR-4302:
--

[branch_4x commit] Shalin Shekhar Mangar
http://svn.apache.org/viewvc?view=revisionrevision=1432903

SOLR-4302: New parameter 'indexInfo' (defaults to true) in CoreAdmin STATUS 
command can be used to omit index specific information


 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Fix For: 4.1, 5.0

 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552633#comment-13552633
 ] 

Michael McCandless commented on LUCENE-4682:


{quote}
bq. So FST would be ~39% larger if we remove NEXT

But according to your notes above, we have 28% waste for this (with a long 
output).
Are we making the right tradeoff?
{quote}

Wait: the 28% waste comes from the array arcs (unrelated to NEXT?).  To fix 
that I think we should use a skip list?  Surely the bytes required to encode 
the skip list are less than our waste today.

{quote}
bq. Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?

Or, not do it at all if we cant figure it out? The reversing holds back other 
improvements so
benchmarking it by itself could be misleading.
{quote}

I don't think we should drop NEXT unless we have some alternative?  39% 
increase is size is non-trivial!

I know reversing held back delta-code of the node target, but, that looks like 
it won't gain much.  What else is it holding back?

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552635#comment-13552635
 ] 

Uwe Schindler commented on LUCENE-4570:
---

I started a google code project: http://code.google.com/p/forbidden-apis/

This is a fork with many new additions:
- auto-generated deprecated signature list (from rt.jar)
- support for bundled and project-maintained signature lists (like the 
deprecated ones for various JDK versions, the well known charset/locale/... 
violators)
- no direct ASM 4.1 dependency conflicting with other dependencies: The ASM 
library is jarjar'ed into the artifact
- _not yet:_ Comments for every signature thats printed in error message
- _not yet:_ Mäven support (Mojo)

Once there is a release (hopefully soon)

 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552635#comment-13552635
 ] 

Uwe Schindler edited comment on LUCENE-4570 at 1/14/13 1:05 PM:


I started a google code project: http://code.google.com/p/forbidden-apis/

This is a fork with many new additions:
- auto-generated deprecated signature list (from rt.jar)
- support for bundled and project-maintained signature lists (like the 
deprecated ones for various JDK versions, the well known charset/locale/... 
violators)
- no direct ASM 4.1 dependency conflicting with other dependencies: The ASM 
library is jarjar'ed into the artifact
- _not yet:_ Comments for every signature thats printed in error message
- _not yet:_ Mäven support (Mojo) - Selckin already started a fork in Github, 
but as the new project is almost a complete rewrite of the API (decouple ANT 
task from logic), I will need his help
- _not yet:_ Mäven Release, so IVY can download it

Once there is a release (hopefully soon), this can ivy:cachepath'ed and 
taskdef'ed into the Lucene build

  was (Author: thetaphi):
I started a google code project: http://code.google.com/p/forbidden-apis/

This is a fork with many new additions:
- auto-generated deprecated signature list (from rt.jar)
- support for bundled and project-maintained signature lists (like the 
deprecated ones for various JDK versions, the well known charset/locale/... 
violators)
- no direct ASM 4.1 dependency conflicting with other dependencies: The ASM 
library is jarjar'ed into the artifact
- _not yet:_ Comments for every signature thats printed in error message
- _not yet:_ Mäven support (Mojo)

Once there is a release (hopefully soon)
  
 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552641#comment-13552641
 ] 

Robert Muir commented on LUCENE-4682:
-

{quote}
Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that 
I think we should use a skip list? Surely the bytes required to encode the skip 
list are less than our waste today.
{quote}

{quote}
I know reversing held back delta-code of the node target, but, that looks like 
it won't gain much. What else is it holding back?
{quote}

I mean in general NEXT/reversing adds more complexity here which makes it 
harder to try different things? Like a big doberman and BEWARE sign scaring off 
developers :)

Its a big part of what sent me over the edge trying to refactor FST to have a 
store abstraction (LUCENE-4593). But fortunately you did that anyway...

It would be really really really good for FSTs long term to do things like 
remove reversing, remove packed (fold these optos or at least most of them in 
by default), etc.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552644#comment-13552644
 ] 

Dawid Weiss commented on LUCENE-4570:
-

Nice!

 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: looking for package org.apache.lucene.analysis.standard

2013-01-14 Thread Steve Rowe
Hi Jim,

Try getting rid of the scopeprovided/scope lines.

Steve
On Jan 14, 2013 5:38 AM, JimAld jim.alder...@db.com wrote:

 Thanks to everyone, I feel I'm getting somewhere, but not quite there yet.
 I
 currently have the below in my pom. When I change my import to:
 import org.apache.lucene.queryparser.classic.QueryParser;
 Eclipse says it can't find org.apache.lucene.queryparser however, the
 maven installer has no such issue.

 The maven installer, does however have an issue with this line:
 Analyzer analyzer = new StandardAnalyzer();
 It says:
 cannot find symbol
 symbol  : constructor StandardAnalyzer()
 location: class org.apache.lucene.analysis.standard.StandardAnalyzer
 Even though I have the import:
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 Which Eclipse has no issue with.

 I've cleaned my project and restarted Eclipse with no improvement to the
 differences shown by Eclipse and Maven. Any help much appreciated!

 Pom dependencies:
 dependency
 groupIdorg.apache.lucene/groupId
 artifactIdlucene-core/artifactId
 version4.0.0/version
 scopeprovided/scope
 /dependency
 dependency
 groupIdorg.apache.lucene/groupId
 artifactIdlucene-analyzers-common/artifactId
 version4.0.0/version
 scopeprovided/scope
 /dependency
 dependency
 groupIdorg.apache.lucene/groupId
 artifactIdlucene-queryparser/artifactId
 version4.0.0/version
 scopeprovided/scope
 /dependency



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033104.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552653#comment-13552653
 ] 

Michael McCandless commented on LUCENE-4682:


bq. I mean in general NEXT/reversing adds more complexity here which makes it 
harder to try different things? Like a big doberman and BEWARE sign scaring off 
developers 

LOL :)

But yeah I agree.

bq. Its a big part of what sent me over the edge trying to refactor FST to have 
a store abstraction (LUCENE-4593). But fortunately you did that anyway...

Right but it's not good if bus factor is 1 ... it's effectively dead code when 
that happens.

bq. It would be really really really good for FSTs long term to do things like 
remove reversing, remove packed (fold these optos or at least most of them in 
by default), etc.

+1, except that NEXT buys us a much smaller FST now.  We can't just drop it ... 
we need some way to simplify it (eg somehow stop reversing).

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Maybe doing bulk-vInt-decode (see patch) will be faster (just make hotspot's 
job easier) ... I'll test.

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
 LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552656#comment-13552656
 ] 

Michael McCandless commented on LUCENE-3298:


bq. The impact will show on 32-bit systems I'm pretty sure of that.

Yeah I think it will too ...

bq. We don't care about hardware archaeology, do we?

I think Lucene should continue to run on 32 bit hardware, but I don't think 
performance on 32 bit is important, ie we should optimize for 64 bit 
performance.

 FST has hard limit max size of 2.1 GB
 -

 Key: LUCENE-3298
 URL: https://issues.apache.org/jira/browse/LUCENE-3298
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, 
 LUCENE-3298.patch


 The FST uses a single contiguous byte[] under the hood, which in java is 
 indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
 internally encodes references to this array as vInt.
 We could switch this to a paged byte[] and make the far larger.
 But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-4570:
-

Assignee: Uwe Schindler

 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
Assignee: Uwe Schindler

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552668#comment-13552668
 ] 

Shai Erera commented on LUCENE-4620:


I see. I have two comments about the patch. This part is wrong:

{code}
+int needed = upto - buf.offset;
+if (values.length  needed) {
+  values.grow(needed);
+}
{code}

should be

{code}
+if (values.ints.length  buf.length) {
+  values.grow(buf.length);
+}
{code}

Does it even run for you? because {{values.length = 0}} at start.

Also, note how this way you check offset  upto on every byte read while in the 
current code it's checked only once per integer read. Maybe if you do a while 
loop inside the loop, something like {{while (b  0)}}.

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
 LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552668#comment-13552668
 ] 

Shai Erera edited comment on LUCENE-4620 at 1/14/13 2:10 PM:
-

I see. I have two comments about the patch. This part is wrong:

{code}
+int needed = upto - buf.offset;
+if (values.length  needed) {
+  values.grow(needed);
+}
{code}

should be

{code}
+if (values.ints.length  buf.length) {
+  values.grow(buf.length);
+}
{code}

With your patch, values.grow() is always called, even if inside it doesn't do 
anything.
I wonder if we should not {{grow()}} the array, but rather grow it from the 
outside ourselves. Because IntsRef.grow() checks the capacity again (and Robert 
is against grow() anyway...).

Also, note how this way you check offset  upto on every byte read while in the 
current code it's checked only once per integer read. Maybe if you do a while 
loop inside the loop, something like {{while (b  0)}}.

  was (Author: shaie):
I see. I have two comments about the patch. This part is wrong:

{code}
+int needed = upto - buf.offset;
+if (values.length  needed) {
+  values.grow(needed);
+}
{code}

should be

{code}
+if (values.ints.length  buf.length) {
+  values.grow(buf.length);
+}
{code}

Does it even run for you? because {{values.length = 0}} at start.

Also, note how this way you check offset  upto on every byte read while in the 
current code it's checked only once per integer read. Maybe if you do a while 
loop inside the loop, something like {{while (b  0)}}.
  
 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
 LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552677#comment-13552677
 ] 

Simon Willnauer commented on LUCENE-4676:
-

after looking at IW infostreams for a while I am convinced this is a test-bug 
(a pretty rare one I'd say). So what happens here is the following 
(applyDeletes=false):

{noformat}
1. Thread[1] adds a doc (D1)
2. Thread[1] pull a new reader
3. Thread[1] adds another doc (D2)
3a. Thread[2] pull a new reader
3b. Thread[2] adds a del query
3c. Thread[2] pull a new reader
4. Thread[1] checks if reader is current 
{noformat}
(3a - 3c are concurrent) given that we don't apply deletes on a NRT reader pull 
we should see _isCurrent == false_ 

Well this works most of the time unless there is a concurrent merge kicked off 
right after doc was added in _3_ that sees both flushed segments (D1 and D2) 
and subsequently tries to apply deletes to those segments. Here comes the 
problem, if the applyDeletes is fast enough (ie. reaches 
BufferedDeletesStream#prune()) before _4_ it drops the delete query from the 
streams (correct behavior!) but doesn't checkpoint since no segment was 
affected. If we check isCurrent now we see a _true_ value since the 
BufferedDeletesStream is empty (pruned) and the merge didn't finish yet (no 
checkpoint) which means the version of the SegmentInfos is the same.

does this make sense? I switched over to NoMergePolicy on this test and tests 
pass all the time (500k times executed) while with a real MP it fails very 
quickly for me.


 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1


 Revision: 1431169
 ant test  -Dtestcase=TestNRTManager 
 -Dtests.method=testThreadStarvationNoDeleteNRTReader 
 -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA 
 -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII 
 -Dtests.dups=500

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552680#comment-13552680
 ] 

Martijn van Groningen commented on LUCENE-3931:
---

This makes sense to me.

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen reassigned LUCENE-3931:
-

Assignee: Martijn van Groningen

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Assignee: Martijn van Groningen
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-4016) Deduplication is broken by partial update

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-4016:


Attachment: SOLR-4016-disallow-partial-update.patch

Patch which disallows partial updates on signature generating fields

 Deduplication is broken by partial update
 -

 Key: SOLR-4016
 URL: https://issues.apache.org/jira/browse/SOLR-4016
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.0
 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS
Reporter: Joel Nothman
Assignee: Shalin Shekhar Mangar
  Labels: 4.0.1_Candidate
 Fix For: 4.1, 5.0

 Attachments: SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch


 The SignatureUpdateProcessorFactory used (primarily?) for deduplication does 
 not consider partial update semantics.
 The below uses the following solrconfig.xml excerpt:
 {noformat}
  updateRequestProcessorChain name=text_hash
processor class=solr.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldtext_hash/str
  bool name=overwriteDupesfalse/bool
  str name=fieldstext/str
  str name=signatureClasssolr.processor.TextProfileSignature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
 {noformat}
 Firstly, the processor treats {noformat}{set: value}{noformat} as a 
 string and hashes it, instead of the value alone:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: {set: hello world'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:30}}
 ?xml version=1.0 encoding=UTF-8?responselst 
 name=responseHeaderint name=status0/intint name=QTime1/intlst 
 name=paramsstr name=qid:abcde/str/lst/lstresult name=response 
 numFound=1 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashad48c7ad60ac22cc/strlong 
 name=_version_1417247434224959488/long/doc/result
 /response
 $
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: hello world}}}'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:27}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashb169c743d220da8d/strlong 
 name=_version_141724802221564/long/doc/result
 /response
 {noformat}
 Note the different text_hash value.
 Secondly, when updating a field other than those used to create the signature 
 (which I imagine is a more common use-case), the signature is recalculated 
 from no values:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, title: {set: new title'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:39}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hash/strstr name=titlenew 
 title/strlong name=_version_1417248120480202752/long/doc/result
 /response
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552686#comment-13552686
 ] 

Tommaso Teofili commented on LUCENE-3931:
-

that's true for Italian as well.

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Assignee: Martijn van Groningen
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552687#comment-13552687
 ] 

Steve Rowe commented on LUCENE-3931:


Because ElisionFilter use is used by more than just French, the set of 
contractions was moved out of ElisionFilter (LUCENE-3884).

The issue of missing French contractions has already been addressed, in 
LUCENE-4662.

I didn't notice this issue - I would have resolved it when I resolved 
LUCENE-4662.

So Martijn, unless there is some other reason to keep this issue open, I think 
it can be resolved as a duplicate.

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Assignee: Martijn van Groningen
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552692#comment-13552692
 ] 

Steve Rowe commented on LUCENE-3931:


bq. that's true for Italian as well.

[ItalianAnalyzer|http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_4_0_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/it/ItalianAnalyzer.java?revision=1396952view=markup#l53]
 includes d in the list of contractions it gives to ElisionFilter.


 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Assignee: Martijn van Groningen
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552729#comment-13552729
 ] 

Tommaso Teofili commented on LUCENE-3931:
-

ok, thanks for clarifying Steve.

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Assignee: Martijn van Groningen
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen closed LUCENE-3931.
-

Resolution: Fixed
  Assignee: (was: Martijn van Groningen)

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread Martijn van Groningen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552761#comment-13552761
 ] 

Martijn van Groningen commented on LUCENE-3931:
---

I see. I'll close it.

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Assignee: Martijn van Groningen
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3931) Adding d character to default ElisionFilter

2013-01-14 Thread David Pilato (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552764#comment-13552764
 ] 

David Pilato commented on LUCENE-3931:
--

Thanks all!

 Adding d character to default ElisionFilter
 -

 Key: LUCENE-3931
 URL: https://issues.apache.org/jira/browse/LUCENE-3931
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: David Pilato
Priority: Trivial

 As described in Wikipedia (http://fr.wikipedia.org/wiki/%C3%89lision), the d 
 character is used in french as an elision character.
 E.g.: déclaration d'espèce
 So, it would be useful to have it as a default elision token.
 {code:title=ElisionFilter.java|borderStyle=solid}
   private static final CharArraySet DEFAULT_ARTICLES = 
 CharArraySet.unmodifiableSet(
   new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList(
   l, m, t, qu, n, s, j, d), true));
 {code}
 HTH
 David.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4602) Use DocValues to store per-doc facet ord

2013-01-14 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-4602:
---

Component/s: modules/facet

 Use DocValues to store per-doc facet ord
 

 Key: LUCENE-4602
 URL: https://issues.apache.org/jira/browse/LUCENE-4602
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Michael McCandless
 Attachments: LUCENE-4602.patch, LUCENE-4602.patch


 Spinoff from LUCENE-4600
 DocValues can be used to hold the byte[] encoding all facet ords for
 the document, instead of payloads.  I made a hacked up approximation
 of in-RAM DV (see CachedCountingFacetsCollector in the patch) and the
 gains were somewhat surprisingly large:
 {noformat}
 TaskQPS base  StdDevQPS comp  StdDev  
   Pct diff
 HighTerm0.53  (0.9%)1.00  (2.5%)   
 87.3% (  83% -   91%)
  LowTerm7.59  (0.6%)   26.75 (12.9%)  
 252.6% ( 237% -  267%)
  MedTerm3.35  (0.7%)   12.71  (9.0%)  
 279.8% ( 268% -  291%)
 {noformat}
 I didn't think payloads were THAT slow; I think it must be the advance
 implementation?
 We need to separately test on-disk DV to make sure it's at least
 on-par with payloads (but hopefully faster) and if so ... we should
 cutover facets to using DV.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



4.1 branch

2013-01-14 Thread Steve Rowe
For anyone with pending patches: I plan on branching for 4.1 at around 1:00pm 
US EST.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present

2013-01-14 Thread Dyer, James
Jack,

Did you test this to see if you could trigger this bug?  But in any case, can 
you open a jira ticket so this won't fall under the radar?  Even if the comment 
that was put here is true I guess we should minimally throw an exception, or 
use the first one and log a warning, maybe?

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Sunday, January 13, 2013 1:24 PM
To: Lucene/Solr Dev
Subject: Possible bug in Solr SpellCheckComponent if more than one 
QueryConverter class is present

Reading through the code for Solr SpellCheckComponent.java for 4.1, it looks 
like it neither complains nor defaults reasonably if more than on 
QueryConverter class is present in the Solr lib directories:

MapString, QueryConverter queryConverters = new HashMapString, 
QueryConverter();
core.initPlugins(queryConverters,QueryConverter.class);

//ensure that there is at least one query converter defined
if (queryConverters.size() == 0) {
  LOG.info(No queryConverter defined, using default converter);
  queryConverters.put(queryConverter, new SpellingQueryConverter());
}

//there should only be one
if (queryConverters.size() == 1) {
  queryConverter = queryConverters.values().iterator().next();
  IndexSchema schema = core.getSchema();
  String fieldTypeName = (String) initParams.get(queryAnalyzerFieldType);
  FieldType fieldType = schema.getFieldTypes().get(fieldTypeName);
  Analyzer analyzer = fieldType == null ? new 
WhitespaceAnalyzer(core.getSolrConfig().luceneMatchVersion)
  : fieldType.getQueryAnalyzer();
  //TODO: There's got to be a better way!  Where's Spring when you need it?
  queryConverter.setAnalyzer(analyzer);
}

No else! And queryConverter is not initialized, except for that code path where 
there was zero or one QueryConverter class.

-- Jack Krupansky

RE: looking for package org.apache.lucene.analysis.standard

2013-01-14 Thread JimAld
Hi,

Well I've managed to fix the issue, sort of, so thought I should summarise
here for any others who stumble across this issue.

The reason maven was not able to build the project is that an empty
constructor for StandardAnalyzer does not exist, even though both Eclipse
and jadclipse showed that it did exist when referencing the 4.0.0 library.
Using the constructor that took the Version as a param fixed this issue and
allowed Maven to build the project.

The Eclipse package explorer now shows no errors, however the Eclipse code
viewer is littered with them. I tried referencing the 3 Lucene jars (shown
above) directly in the classpath and this fixed the errors Eclipse showed
with regards to Lucene, however it introduced a load of new errors with the
rest of the project - can't win! 

Anyhow, I can live without intelli-sense for this project, as long as maven
builds, that's the main thing.

Thanks to everyone for their replies to this post.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033195.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552836#comment-13552836
 ] 

Simon Willnauer commented on LUCENE-4676:
-

to visualize this again here is a commented Log from a failure:

{panel}

IW [Thread-623]: getReader took 2 msec
CMS [Lucene Merge Thread #0]:   merge thread: start
TEST [Thread-623]: refresh after delete  {color:red}== HERE WE 
REFRESH AFTER THE DEL BY QUERY{color}
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
IW [Thread-623]: nrtIsCurrent: infoVersion matches: true DW changes: true BD 
changes: true
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
IW [Thread-623]: nrtIsCurrent: infoVersion matches: true DW changes: true BD 
changes: true
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
IW [Thread-623]: flush at getReader
DW [Thread-623]: Thread-623 startFullFlush
DW [Thread-623]: anyChanges? numDocsInRam=0 deletes=true hasTickets:false 
pendingChangesInFullFlush: false
DWFC [Thread-623]: addFlushableState DocumentsWriterPerThread 
[pendingDeletes=gen=0, segment=null, aborting=false, numDocsInRAM=0, 
deleteQueue=DWDQ: [ generation: 3 ]]
DW [Thread-623]: Thread-623: flush naked frozen global deletes  
{color:red}== HERE WE PUSH THE DEL BY QUERY TO THE BUFFERED DELETE 
STREAM{color}
BD [Thread-623]: push deletes  1 deleted queries bytesUsed=32 delGen=4 
packetCount=2 totBytesUsed=1056
DW [Thread-623]: flush: push buffered deletes:  1 deleted queries bytesUsed=32
BD [Lucene Merge Thread #0]: applyDeletes: infos=[_1(5.0):c1, _0(5.0):C1] 
packetCount=2
BD [Lucene Merge Thread #0]: seg=_1(5.0):c1 segGen=3 coalesced 
deletes=[CoalescedDeletes(termSets=1,queries=1)] newDelCount=0
BD [Lucene Merge Thread #0]: seg=_0(5.0):C1 segGen=1 coalesced 
deletes=[CoalescedDeletes(termSets=2,queries=1)] newDelCount=0
BD [Lucene Merge Thread #0]: applyDeletes took 0 msec  
{color:red}== THE MERGE KICKS IN{color}
BD [Lucene Merge Thread #0]: prune 
sis=org.apache.lucene.index.SegmentInfos@6dfb8d2e minGen=5 packetCount=2
BD [Lucene Merge Thread #0]: pruneDeletes: prune 2 packets; 0 packets remain  
{color:red}== MERGE PRUNES AWAY THE PACKAGE{color}
IW [Lucene Merge Thread #0]: merge seg=_2 _1(5.0):c1 _0(5.0):C1
IW [Lucene Merge Thread #0]: now merge
  merge=_1(5.0):c1 _0(5.0):C1
  index=_0(5.0):C1 _1(5.0):c1
IW [Lucene Merge Thread #0]: merging _1(5.0):c1 _0(5.0):C1
IW [Thread-623]: don't apply deletes now delTermCount=0 bytesUsed=0
IW [Thread-623]: return reader version=6 reader=StandardDirectoryReader(:nrt 
_0(5.0):C1 _1(5.0):c1)
DW [Thread-623]: Thread-623 finishFullFlush success=true
IW [Thread-623]: getReader took 1 msec {color:red}== HERE WE ARE 
DONE REFRESHING AFTER THE DELETE -- DEL QUERY IS ALREADY GONE {color}
IW [Lucene Merge Thread #0]: seg=_1(5.0):c1 no deletes
IW [Lucene Merge Thread #0]: seg=_0(5.0):C1 no deletes
TEST 
[TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]:
 done updating
DW 
[TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]:
 anyChanges? numDocsInRam=0 deletes=false hasTickets:false 
pendingChangesInFullFlush: false
IW 
[TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]:
 nrtIsCurrent: infoVersion matches: true DW changes: false BD changes: false 
{color:red}== HERE WE ARE ASSERTING ON isCurrent == FALSE and 
FAIL!!{color}
DW 
[TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]:
 anyChanges? numDocsInRam=0 deletes=false hasTickets:false 
pendingChangesInFullFlush: false
DW 
[TEST-TestNRTManager.testThreadStarvationNoDeleteNRTReader-seed#[925ECD106FBFA3FF]]:
 anyChanges? numDocsInRam=0 deletes=false hasTickets:false 
pendingChangesInFullFlush: false
SM [Lucene Merge Thread #0]: merge store matchedCount=2 vs 2
{panel}

 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1


 

[JENKINS] Lucene-Solr-trunk-Linux (64bit/jdk1.7.0_10) - Build # 3772 - Failure!

2013-01-14 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/3772/
Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC

All tests passed

Build Log:
[...truncated 25890 lines...]
-documentation-lint:
 [echo] checking for broken html...
[jtidy] Checking for broken html (such as invalid tags)...
   [delete] Deleting directory 
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/lucene/build/jtidy_tmp
 [echo] Checking for broken links...
 [exec] 
 [exec] Crawl/parse...
 [exec] 
 [exec] Verify...
 [exec] 
 [exec] 
file:///build/docs/core/org/apache/lucene/analysis/package-summary.html
 [exec]   BAD EXTERNAL LINK: http://lucene.apache.org/core/discussion.html
 [exec] 
 [exec] Broken javadocs links were found!

BUILD FAILED
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:60: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/lucene/build.xml:242: The 
following error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/lucene/common-build.xml:1961:
 exec returned: 1

Total time: 37 minutes 5 seconds
Build step 'Invoke Ant' marked build as failure
Archiving artifacts
Recording test results
Description set: Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Simon Willnauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-4676:


Attachment: LUCENE-4676.patch

here is a patch to fix this test

 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1

 Attachments: LUCENE-4676.patch


 Revision: 1431169
 ant test  -Dtestcase=TestNRTManager 
 -Dtests.method=testThreadStarvationNoDeleteNRTReader 
 -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA 
 -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII 
 -Dtests.dups=500

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.

2013-01-14 Thread Mark Miller (JIRA)
Mark Miller created SOLR-4303:
-

 Summary: On replication, if the generation of the master is lower 
than the slave we need to force a full copy of the index.
 Key: SOLR-4303
 URL: https://issues.apache.org/jira/browse/SOLR-4303
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.1, 5.0


Doesn't affect SolrCloud since it uses the 'force' option, but a regression in 
Solr 4.0 from 3X it appears.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552888#comment-13552888
 ] 

Commit Tag Bot commented on SOLR-4303:
--

[trunk commit] Mark Robert Miller
http://svn.apache.org/viewvc?view=revisionrevision=1432993

SOLR-4303: On replication, if the generation of the master is lower than the 
slave we need to force a full copy of the index.


 On replication, if the generation of the master is lower than the slave we 
 need to force a full copy of the index.
 --

 Key: SOLR-4303
 URL: https://issues.apache.org/jira/browse/SOLR-4303
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.1, 5.0


 Doesn't affect SolrCloud since it uses the 'force' option, but a regression 
 in Solr 4.0 from 3X it appears.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.

2013-01-14 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated SOLR-4303:
--

Description: 
Doesn't affect SolrCloud since it uses the 'force' option, but a regression in 
Solr 4.0 from 3X it appears.

See 
http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html

  was:Doesn't affect SolrCloud since it uses the 'force' option, but a 
regression in Solr 4.0 from 3X it appears.


 On replication, if the generation of the master is lower than the slave we 
 need to force a full copy of the index.
 --

 Key: SOLR-4303
 URL: https://issues.apache.org/jira/browse/SOLR-4303
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.1, 5.0


 Doesn't affect SolrCloud since it uses the 'force' option, but a regression 
 in Solr 4.0 from 3X it appears.
 See 
 http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552897#comment-13552897
 ] 

Commit Tag Bot commented on SOLR-4303:
--

[branch_4x commit] Mark Robert Miller
http://svn.apache.org/viewvc?view=revisionrevision=1432995

SOLR-4303: On replication, if the generation of the master is lower than the 
slave we need to force a full copy of the index.


 On replication, if the generation of the master is lower than the slave we 
 need to force a full copy of the index.
 --

 Key: SOLR-4303
 URL: https://issues.apache.org/jira/browse/SOLR-4303
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.1, 5.0


 Doesn't affect SolrCloud since it uses the 'force' option, but a regression 
 in Solr 4.0 from 3X it appears.
 See 
 http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-4303) On replication, if the generation of the master is lower than the slave we need to force a full copy of the index.

2013-01-14 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved SOLR-4303.
---

Resolution: Fixed

 On replication, if the generation of the master is lower than the slave we 
 need to force a full copy of the index.
 --

 Key: SOLR-4303
 URL: https://issues.apache.org/jira/browse/SOLR-4303
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.1, 5.0


 Doesn't affect SolrCloud since it uses the 'force' option, but a regression 
 in Solr 4.0 from 3X it appears.
 See 
 http://lucene.472066.n3.nabble.com/Solr-4-0-SnapPuller-version-vs-generation-issue-td4032347.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-4.x-Windows (64bit/jdk1.7.0_10) - Build # 2397 - Failure!

2013-01-14 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Windows/2397/
Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC

All tests passed

Build Log:
[...truncated 25765 lines...]
-documentation-lint:
 [echo] checking for broken html...
[jtidy] Checking for broken html (such as invalid tags)...
   [delete] Deleting directory 
C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\lucene\build\jtidy_tmp
 [echo] Checking for broken links...
 [exec] 
 [exec] Crawl/parse...
 [exec] 
 [exec] Verify...
 [exec] 
 [exec] 
file:///build/docs/core/org/apache/lucene/analysis/package-summary.html
 [exec]   BAD EXTERNAL LINK: http://lucene.apache.org/core/discussion.html
 [exec] 
 [exec] Broken javadocs links were found!

BUILD FAILED
C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\build.xml:60: The 
following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\lucene\build.xml:242: 
The following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-4.x-Windows\lucene\common-build.xml:1960:
 exec returned: 1

Total time: 64 minutes 10 seconds
Build step 'Invoke Ant' marked build as failure
Archiving artifacts
Recording test results
Description set: Java: 64bit/jdk1.7.0_10 -XX:+UseG1GC
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Possible bug in Solr SpellCheckComponent if more than one QueryConverter class is present

2013-01-14 Thread Jack Krupansky
I just tried, and it causes an NPE, kind of as I had expected. I’ll file the 
Jira.

-- Jack Krupansky

From: Dyer, James 
Sent: Monday, January 14, 2013 10:50 AM
To: dev@lucene.apache.org 
Subject: RE: Possible bug in Solr SpellCheckComponent if more than one 
QueryConverter class is present

Jack,

 

Did you test this to see if you could trigger this bug?  But in any case, can 
you open a jira ticket so this won't fall under the radar?  Even if the comment 
that was put here is true I guess we should minimally throw an exception, or 
use the first one and log a warning, maybe?

 

James Dyer

E-Commerce Systems

Ingram Content Group

(615) 213-4311

 

From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Sunday, January 13, 2013 1:24 PM
To: Lucene/Solr Dev
Subject: Possible bug in Solr SpellCheckComponent if more than one 
QueryConverter class is present

 

Reading through the code for Solr SpellCheckComponent.java for 4.1, it looks 
like it neither complains nor defaults reasonably if more than on 
QueryConverter class is present in the Solr lib directories:

 

MapString, QueryConverter queryConverters = new HashMapString, 
QueryConverter();

core.initPlugins(queryConverters,QueryConverter.class);

 

//ensure that there is at least one query converter defined

if (queryConverters.size() == 0) {

  LOG.info(No queryConverter defined, using default converter);

  queryConverters.put(queryConverter, new SpellingQueryConverter());

}

 

//there should only be one

if (queryConverters.size() == 1) {

  queryConverter = queryConverters.values().iterator().next();

  IndexSchema schema = core.getSchema();

  String fieldTypeName = (String) initParams.get(queryAnalyzerFieldType);

  FieldType fieldType = schema.getFieldTypes().get(fieldTypeName);

  Analyzer analyzer = fieldType == null ? new 
WhitespaceAnalyzer(core.getSolrConfig().luceneMatchVersion)

  : fieldType.getQueryAnalyzer();

  //TODO: There's got to be a better way!  Where's Spring when you need it?

  queryConverter.setAnalyzer(analyzer);

}

 

No else! And queryConverter is not initialized, except for that code path where 
there was zero or one QueryConverter class.


-- Jack Krupansky


[jira] [Created] (LUCENE-4684) Allow DirectSpellChecker to be extended

2013-01-14 Thread Martijn van Groningen (JIRA)
Martijn van Groningen created LUCENE-4684:
-

 Summary: Allow DirectSpellChecker  to be extended
 Key: LUCENE-4684
 URL: https://issues.apache.org/jira/browse/LUCENE-4684
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
 Environment: Currently the suggestSimilar() that actually operates on 
the FuzzyTermy is private protected. Would be great if that would just be 
protected for extensions.
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
Priority: Minor




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: svn commit: r1433005 - in /lucene/dev/branches/branch_4x: ./ dev-tools/ dev-tools/scripts/checkJavadocLinks.py

2013-01-14 Thread Steve Rowe
Thanks Robert. - Steve

On Jan 14, 2013, at 12:41 PM, rm...@apache.org wrote:

 Author: rmuir
 Date: Mon Jan 14 17:41:01 2013
 New Revision: 1433005
 
 URL: http://svn.apache.org/viewvc?rev=1433005view=rev
 Log:
 whitelist this link
 
 Modified:
lucene/dev/branches/branch_4x/   (props changed)
lucene/dev/branches/branch_4x/dev-tools/   (props changed)
lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py
 
 Modified: lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py
 URL: 
 http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py?rev=1433005r1=1433004r2=1433005view=diff
 ==
 --- lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py 
 (original)
 +++ lucene/dev/branches/branch_4x/dev-tools/scripts/checkJavadocLinks.py Mon 
 Jan 14 17:41:01 2013
 @@ -197,6 +197,9 @@ def checkAll(dirName):
 elif link.find('lucene.apache.org/java/docs/discussion.html') != -1:
   # OK
   pass
 +elif link.find('lucene.apache.org/core/discussion.html') != -1:
 +  # OK
 +  pass
 elif 
 link.find('lucene.apache.org/solr/mirrors-solr-latest-redir.html') != -1:
   # OK
   pass
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4684) Allow DirectSpellChecker to be extended

2013-01-14 Thread Martijn van Groningen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-4684:
--

Attachment: LUCENE-4684.patch

Made all the fields of DirectSpellChecker protected, the suggestSimilar method 
and the ScoreTerm inner class.

 Allow DirectSpellChecker  to be extended
 

 Key: LUCENE-4684
 URL: https://issues.apache.org/jira/browse/LUCENE-4684
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
 Environment: Currently the suggestSimilar() that actually operates on 
 the FuzzyTermy is private protected. Would be great if that would just be 
 protected for extensions.
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
Priority: Minor
 Attachments: LUCENE-4684.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2765) Optimize scanning in DocsEnum

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2765:
---

Fix Version/s: (was: 4.1)
   4.2

 Optimize scanning in DocsEnum
 -

 Key: LUCENE-2765
 URL: https://issues.apache.org/jira/browse/LUCENE-2765
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.2

 Attachments: LUCENE-2765.patch, LUCENE-2765.patch


 Similar to LUCENE-2761:
 when we call advance(), after skipping it scans, but this can be optimized 
 better than calling nextDoc() like today
 {noformat}
   // scan for the rest:
   do {
 nextDoc();
   } while (target  doc);
 {noformat}
 in particular, the freq can be skipVinted and the skipDocs (deletedDocs) 
 don't need to be checked during this scanning.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2832) on Windows 64-bit, maybe we should default to a better maxBBufSize in MMapDirectory

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2832:
---

Fix Version/s: (was: 4.1)
   4.2

 on Windows 64-bit, maybe we should default to a better maxBBufSize in 
 MMapDirectory
 ---

 Key: LUCENE-2832
 URL: https://issues.apache.org/jira/browse/LUCENE-2832
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/store
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.2

 Attachments: LUCENE-2832.patch


 Currently the default max buffer size for MMapDirectory is 256MB on 32bit and 
 Integer.MAX_VALUE on 64bit:
 {noformat}
 public static final int DEFAULT_MAX_BUFF = Constants.JRE_IS_64BIT ? 
 Integer.MAX_VALUE : (256 * 1024 * 1024);
 {noformat}
 But, in windows on 64-bit, you are practically limited to 8TB. This can cause 
 problems in extreme cases, such as: 
 http://www.lucidimagination.com/search/document/7522ee54c46f9ca4/map_failed_at_getsearcher
 Perhaps it would be good to change this default such that its 256MB on 32Bit 
 *OR* windows, but leave it at Integer.MAX_VALUE
 on other 64-bit and 64-bit (48-bit) systems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-4016) Deduplication is broken by partial update

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-4016:


Attachment: SOLR-4016-disallow-partial-update.patch

Patch with a better test.

 Deduplication is broken by partial update
 -

 Key: SOLR-4016
 URL: https://issues.apache.org/jira/browse/SOLR-4016
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.0
 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS
Reporter: Joel Nothman
Assignee: Shalin Shekhar Mangar
  Labels: 4.0.1_Candidate
 Fix For: 4.1, 5.0

 Attachments: SOLR-4016-disallow-partial-update.patch, 
 SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch


 The SignatureUpdateProcessorFactory used (primarily?) for deduplication does 
 not consider partial update semantics.
 The below uses the following solrconfig.xml excerpt:
 {noformat}
  updateRequestProcessorChain name=text_hash
processor class=solr.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldtext_hash/str
  bool name=overwriteDupesfalse/bool
  str name=fieldstext/str
  str name=signatureClasssolr.processor.TextProfileSignature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
 {noformat}
 Firstly, the processor treats {noformat}{set: value}{noformat} as a 
 string and hashes it, instead of the value alone:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: {set: hello world'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:30}}
 ?xml version=1.0 encoding=UTF-8?responselst 
 name=responseHeaderint name=status0/intint name=QTime1/intlst 
 name=paramsstr name=qid:abcde/str/lst/lstresult name=response 
 numFound=1 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashad48c7ad60ac22cc/strlong 
 name=_version_1417247434224959488/long/doc/result
 /response
 $
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: hello world}}}'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:27}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashb169c743d220da8d/strlong 
 name=_version_141724802221564/long/doc/result
 /response
 {noformat}
 Note the different text_hash value.
 Secondly, when updating a field other than those used to create the signature 
 (which I imagine is a more common use-case), the signature is recalculated 
 from no values:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, title: {set: new title'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:39}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hash/strstr name=titlenew 
 title/strlong name=_version_1417248120480202752/long/doc/result
 /response
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2570) randomize indexwriter settings in solr tests

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-2570:
-

Fix Version/s: (was: 4.1)
   4.2

 randomize indexwriter settings in solr tests
 

 Key: SOLR-2570
 URL: https://issues.apache.org/jira/browse/SOLR-2570
 Project: Solr
  Issue Type: Test
  Components: Build
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.2

 Attachments: SOLR-2570.patch


 we should randomize indexwriter settings like lucene tests do, to vary # of 
 segments and such.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-1674) improve analysis tests, cut over to new API

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated SOLR-1674:
-

Fix Version/s: (was: 4.1)
   4.2

 improve analysis tests, cut over to new API
 ---

 Key: SOLR-1674
 URL: https://issues.apache.org/jira/browse/SOLR-1674
 Project: Solr
  Issue Type: Test
  Components: Schema and Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.2

 Attachments: SOLR-1674.patch, SOLR-1674.patch, SOLR-1674_speedup.patch


 This patch
 * converts all analysis tests to use the new tokenstream api
 * converts most tests to use the more stringent assertion mechanisms from 
 lucene
 * adds new tests to improve coverage
 Most bugs found by more stringent testing have been fixed, with the exception 
 of SynonymFilter.
 The problems with this filter are more serious, the previous tests were 
 essentially a no-op.
 The new tests for SynonymFilter test the current behavior, but have FIXMEs 
 with what I think the old test wanted to expect in the comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3459) Change ChainedFilter to use FixedBitSet

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3459:
---

Fix Version/s: (was: 4.1)
   4.2

 Change ChainedFilter to use FixedBitSet
 ---

 Key: LUCENE-3459
 URL: https://issues.apache.org/jira/browse/LUCENE-3459
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/other
Affects Versions: 3.4, 4.0-ALPHA
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.2


 ChainedFilter also uses OpenBitSet(DISI) at the moment. It should also be 
 changed to use FixedBitSet. There are two issues:
 - It exposes sometimes OpenBitSetDISI to it's public API - we should remove 
 those methods like in BooleanFilter and break backwards
 - It allows a XOR operation. This is not yet supported by FixedBitSet, but 
 it's easy to add (like for BooleanFilter). On the other hand, this XOR 
 operation is bogus, as it may mark documents in the BitSet that are deleted, 
 breaking new features like applying Filters down-low (LUCENE-1536). We should 
 remove the XOR operation maybe or force it to use IR.validDocs() (trunk) or 
 IR.isDeleted()

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3034) If you vary a setting per round and that setting is a long string, the report padding/columns break down.

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3034:
---

Fix Version/s: (was: 4.1)
   4.2

 If you vary a setting per round and that setting is a long string, the report 
 padding/columns break down.
 -

 Key: LUCENE-3034
 URL: https://issues.apache.org/jira/browse/LUCENE-3034
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/benchmark
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Trivial
 Fix For: 4.2


 This is especially noticeable if you vary a setting where the value is a 
 fully specified class name - in this case, it would be nice if columns in 
 each row still lined up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3451) Remove special handling of pure negative Filters in BooleanFilter, disallow pure negative queries in BooleanQuery

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3451:
---

Fix Version/s: (was: 4.1)
   4.2

 Remove special handling of pure negative Filters in BooleanFilter, disallow 
 pure negative queries in BooleanQuery
 -

 Key: LUCENE-3451
 URL: https://issues.apache.org/jira/browse/LUCENE-3451
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.2

 Attachments: LUCENE-3451.patch, LUCENE-3451.patch, LUCENE-3451.patch, 
 LUCENE-3451.patch, LUCENE-3451.patch


 We should at least in Lucene 4.0 remove the hack in BooleanFilter that allows 
 pure negative Filter clauses. This is not supported by BooleanQuery and 
 confuses users (I think that's the problem in LUCENE-3450).
 The hack is buggy, as it does not respect deleted documents and returns them 
 in its DocIdSet.
 Also we should think about disallowing pure-negative Queries at all and throw 
 UOE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3968) Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3968:
---

Fix Version/s: (was: 4.1)
   4.2

 Factor MockGraphTokenFilter into LookaheadTokenFilter + random tokens
 -

 Key: LUCENE-3968
 URL: https://issues.apache.org/jira/browse/LUCENE-3968
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.2

 Attachments: LUCENE-3968.patch


 MockGraphTokenFilter is rather hairy... I've managed to simplify it (I 
 think!) by breaking apart its two functions...
 I think LookaheadTokenFilter can be used in the future for other graph aware 
 filters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4016) Deduplication is broken by partial update

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552935#comment-13552935
 ] 

Commit Tag Bot commented on SOLR-4016:
--

[trunk commit] Shalin Shekhar Mangar
http://svn.apache.org/viewvc?view=revisionrevision=1433013

SOLR-4016: Deduplication does not work with atomic/partial updates so disallow 
atomic update requests which change signature generating fields.


 Deduplication is broken by partial update
 -

 Key: SOLR-4016
 URL: https://issues.apache.org/jira/browse/SOLR-4016
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.0
 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS
Reporter: Joel Nothman
Assignee: Shalin Shekhar Mangar
  Labels: 4.0.1_Candidate
 Fix For: 4.1, 5.0

 Attachments: SOLR-4016-disallow-partial-update.patch, 
 SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch


 The SignatureUpdateProcessorFactory used (primarily?) for deduplication does 
 not consider partial update semantics.
 The below uses the following solrconfig.xml excerpt:
 {noformat}
  updateRequestProcessorChain name=text_hash
processor class=solr.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldtext_hash/str
  bool name=overwriteDupesfalse/bool
  str name=fieldstext/str
  str name=signatureClasssolr.processor.TextProfileSignature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
 {noformat}
 Firstly, the processor treats {noformat}{set: value}{noformat} as a 
 string and hashes it, instead of the value alone:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: {set: hello world'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:30}}
 ?xml version=1.0 encoding=UTF-8?responselst 
 name=responseHeaderint name=status0/intint name=QTime1/intlst 
 name=paramsstr name=qid:abcde/str/lst/lstresult name=response 
 numFound=1 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashad48c7ad60ac22cc/strlong 
 name=_version_1417247434224959488/long/doc/result
 /response
 $
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: hello world}}}'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:27}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashb169c743d220da8d/strlong 
 name=_version_141724802221564/long/doc/result
 /response
 {noformat}
 Note the different text_hash value.
 Secondly, when updating a field other than those used to create the signature 
 (which I imagine is a more common use-case), the signature is recalculated 
 from no values:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, title: {set: new title'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:39}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hash/strstr name=titlenew 
 title/strlong name=_version_1417248120480202752/long/doc/result
 /response
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-4304) NPE in Solr SpellCheckComponent if more than one QueryConverter

2013-01-14 Thread Jack Krupansky (JIRA)
Jack Krupansky created SOLR-4304:


 Summary: NPE in Solr SpellCheckComponent if more than one 
QueryConverter
 Key: SOLR-4304
 URL: https://issues.apache.org/jira/browse/SOLR-4304
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 4.0
Reporter: Jack Krupansky


The Solr SpellCheckComponent uses only a single QueryConverter, but fails with 
an NPE if more than one QueryConverter class is registered in solrconfig.xml.

Repro:

1. Add to 4.0 example solrconfig.xml:

queryConverter name=myQueryConverter-1 class=solr.SpellingQueryConverter/
queryConverter name=myQueryConverter-2 class=solr.SuggestQueryConverter/

2. Perform a spellcheck request:

curl http://localhost:8983/solr/spell?q=testindent=true;

3. Examine the NPE:

?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status500/int
  int name=QTime4/int
/lst
result name=response numFound=0 start=0
/result
lst name=error
  str name=tracejava.lang.NullPointerException
at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:136)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Unknown Source)
/str
  int name=code500/int
/lst
/response

Suggested resolution: Use the first QueryConverter, but give a warning that 
indicates the class name of the one being used. Alternatively, throw a nasty 
but informative exception indicating the true nature of the problem.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3424) Return sequence ids from IW update/delete/add/commit to allow total ordering outside of IW

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3424:
---

Fix Version/s: (was: 4.1)
   4.2

 Return sequence ids from IW update/delete/add/commit to allow total ordering 
 outside of IW
 --

 Key: LUCENE-3424
 URL: https://issues.apache.org/jira/browse/LUCENE-3424
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.2

 Attachments: LUCENE-3424.patch


 Based on the discussion on the [mailing 
 list|http://mail-archives.apache.org/mod_mbox/lucene-dev/201109.mbox/%3CCAAHmpki-h7LUZGCUX_rfFx=q5-YkLJei+piRG=oic8d1pnr...@mail.gmail.com%3E]
  IW should return sequence ids from update/delete/add and commit to allow 
 ordering of events for consistent transaction logs and recovery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4016) Deduplication is broken by partial update

2013-01-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552950#comment-13552950
 ] 

Yonik Seeley commented on SOLR-4016:


bq. I see why you suggested that. The signature is like a unique key and 
modifying it seems like a rare use-case. But, if we do go that way, we should 
throw an exception and explicitly disallow partial update of signature 
generating fields.

There are different use-cases here.  If the signature being generated was the 
unique key, then atomic updates should be able to proceed fine as long as the 
id field is specified (as should always be the case with atomic updates).

 Deduplication is broken by partial update
 -

 Key: SOLR-4016
 URL: https://issues.apache.org/jira/browse/SOLR-4016
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.0
 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS
Reporter: Joel Nothman
Assignee: Shalin Shekhar Mangar
  Labels: 4.0.1_Candidate
 Fix For: 4.1, 5.0

 Attachments: SOLR-4016-disallow-partial-update.patch, 
 SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch


 The SignatureUpdateProcessorFactory used (primarily?) for deduplication does 
 not consider partial update semantics.
 The below uses the following solrconfig.xml excerpt:
 {noformat}
  updateRequestProcessorChain name=text_hash
processor class=solr.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldtext_hash/str
  bool name=overwriteDupesfalse/bool
  str name=fieldstext/str
  str name=signatureClasssolr.processor.TextProfileSignature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
 {noformat}
 Firstly, the processor treats {noformat}{set: value}{noformat} as a 
 string and hashes it, instead of the value alone:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: {set: hello world'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:30}}
 ?xml version=1.0 encoding=UTF-8?responselst 
 name=responseHeaderint name=status0/intint name=QTime1/intlst 
 name=paramsstr name=qid:abcde/str/lst/lstresult name=response 
 numFound=1 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashad48c7ad60ac22cc/strlong 
 name=_version_1417247434224959488/long/doc/result
 /response
 $
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: hello world}}}'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:27}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashb169c743d220da8d/strlong 
 name=_version_141724802221564/long/doc/result
 /response
 {noformat}
 Note the different text_hash value.
 Secondly, when updating a field other than those used to create the signature 
 (which I imagine is a more common use-case), the signature is recalculated 
 from no values:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, title: {set: new title'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:39}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hash/strstr name=titlenew 
 title/strlong name=_version_1417248120480202752/long/doc/result
 /response
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3069:
---

Fix Version/s: (was: 4.1)
   4.2

 Lucene should have an entirely memory resident term dictionary
 --

 Key: LUCENE-3069
 URL: https://issues.apache.org/jira/browse/LUCENE-3069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index, core/search
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Simon Willnauer
  Labels: gsoc2012, lucene-gsoc-12
 Fix For: 4.2


 FST based TermDictionary has been a great improvement yet it still uses a 
 delta codec file for scanning to terms. Some environments have enough memory 
 available to keep the entire FST based term dict in memory. We should add a 
 TermDictionary implementation that encodes all needed information for each 
 term into the FST (custom fst.Output) and builds a FST from the entire term 
 not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3022:
---

Fix Version/s: (was: 4.1)
   4.2

 DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
 -

 Key: LUCENE-3022
 URL: https://issues.apache.org/jira/browse/LUCENE-3022
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 2.9.4, 3.1
Reporter: Johann Höchtl
Assignee: Robert Muir
Priority: Minor
 Fix For: 4.2

 Attachments: LUCENE-3022.patch, LUCENE-3022.patch

   Original Estimate: 5m
  Remaining Estimate: 5m

 When using the DictionaryCompoundWordTokenFilter with a german dictionary, I 
 got a strange behaviour:
 The german word streifenbluse (blouse with stripes) was decompounded to 
 streifen (stripe),reifen(tire) which makes no sense at all.
 I thought the flag onlyLongestMatch would fix this, because streifen is 
 longer than reifen, but it had no effect.
 So I reviewed the sourcecode and found the problem:
 [code]
 protected void decomposeInternal(final Token token) {
 // Only words longer than minWordSize get processed
 if (token.length()  this.minWordSize) {
   return;
 }
 
 char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
 
 for (int i=0;itoken.length()-this.minSubwordSize;++i) {
 Token longestMatchToken=null;
 for (int j=this.minSubwordSize-1;jthis.maxSubwordSize;++j) {
 if(i+jtoken.length()) {
 break;
 }
 if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
 if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
  if (longestMatchToken.length()j) {
longestMatchToken=createToken(i,j,token);
  }
} else {
  longestMatchToken=createToken(i,j,token);
}
 } else {
tokens.add(createToken(i,j,token));
 }
 } 
 }
 if (this.onlyLongestMatch  longestMatchToken!=null) {
   tokens.add(longestMatchToken);
 }
 }
   }
 [/code]
 should be changed to 
 [code]
 protected void decomposeInternal(final Token token) {
 // Only words longer than minWordSize get processed
 if (token.termLength()  this.minWordSize) {
   return;
 }
 char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
 Token longestMatchToken=null;
 for (int i=0;itoken.termLength()-this.minSubwordSize;++i) {
 for (int j=this.minSubwordSize-1;jthis.maxSubwordSize;++j) {
 if(i+jtoken.termLength()) {
 break;
 }
 if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
 if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
  if (longestMatchToken.termLength()j) {
longestMatchToken=createToken(i,j,token);
  }
} else {
  longestMatchToken=createToken(i,j,token);
}
 } else {
tokens.add(createToken(i,j,token));
 }
 }
 }
 }
 if (this.onlyLongestMatch  longestMatchToken!=null) {
 tokens.add(longestMatchToken);
 }
   }
 [/code]
 So, that only the longest token is really indexed and the onlyLongestMatch 
 Flag makes sense.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3797) 3xCodec should throw UOE if a DocValuesConsumer is pulled

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3797:
---

Fix Version/s: (was: 4.1)
   4.2

 3xCodec should throw UOE if a DocValuesConsumer is pulled 
 --

 Key: LUCENE-3797
 URL: https://issues.apache.org/jira/browse/LUCENE-3797
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs, core/index
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.2

 Attachments: LUCENE-3797.patch, LUCENE-3797.patch


 currently we just return null if a DVConsumer is pulled from 3.x which is 
 trappy since it causes an NPE in DocFieldProcessor. We should rather throw a 
 UOE.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3252) Use single array in fixed straight bytes DocValues if possible

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-3252:
---

Fix Version/s: (was: 4.1)
   4.2

 Use single array in fixed straight bytes DocValues if possible
 --

 Key: LUCENE-3252
 URL: https://issues.apache.org/jira/browse/LUCENE-3252
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search, core/store
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.2

 Attachments: LUCENE-3252.patch


 FixedStraightBytesImpl currently uses a straight array only if the byte size 
 is 1 per document we could further optimize this to use a single array if all 
 the values fit in.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1921) Absurdly large radius (miles) search fails to include entire earth

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-1921:
---

Fix Version/s: (was: 4.1)
   4.2

 Absurdly large radius (miles) search fails to include entire earth
 --

 Key: LUCENE-1921
 URL: https://issues.apache.org/jira/browse/LUCENE-1921
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/spatial
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Chris Male
Priority: Minor
 Fix For: 4.2

 Attachments: ASF.LICENSE.NOT.GRANTED--TEST-1921.patch


 Spinoff from LUCENE-1781.
 If you do a very large (eg 10 miles) radius search then the
 lat/lng bound box wraps around the entire earth and all points should
 be accepted.  But this fails today (many points are rejected).  It's
 easy to see the issue: edit TestCartesian, and insert a very large
 miles into either testRange or testGeoHashRange.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1710) Add byte/short to NumericUtils, NumericField and NumericRangeQuery

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-1710:
---

Fix Version/s: (was: 4.1)
   4.2

 Add byte/short to NumericUtils, NumericField and NumericRangeQuery
 --

 Key: LUCENE-1710
 URL: https://issues.apache.org/jira/browse/LUCENE-1710
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/index, core/search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 4.2


 Although NumericRangeQuery will not profit much from trie-encoding short/byte 
 fields (byte fields with e.g. precisionStep 8 would only create one 
 precision), it may be good to have these two data types available with 
 NumericField to be generally able to store them in prefix-encoded form in 
 index.
 This is important for loading them into FieldCache where they require much 
 less memory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2527) FieldCache.getTermsIndex should cache fasterButMoreRAM=true|false to the same cache key

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2527:
---

Fix Version/s: (was: 4.1)
   4.2

 FieldCache.getTermsIndex should cache fasterButMoreRAM=true|false to the same 
 cache key
 ---

 Key: LUCENE-2527
 URL: https://issues.apache.org/jira/browse/LUCENE-2527
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 4.0-ALPHA
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.2


 When we cutover FieldCache to use shared byte[] blocks, we added the boolean 
 fasterButMoreRAM option, so you could tradeoff time/space.
 It defaults to true.
 The thinking is that an expert user, who wants to use false, could 
 pre-populate FieldCache by loading the field with false, and then later when 
 sorting on that field it'd use that same entry.
 But there's a bug -- when sorting, it then loads a 2nd entry with true.  
 This is because the Entry.custom in FieldCache participates in 
 equals/hashCode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-1820) WildcardQueryNode to expose the positions of the wildcard characters, for easier use in processors and builders

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-1820:
---

Fix Version/s: (was: 4.1)
   4.2

 WildcardQueryNode to expose the positions of the wildcard characters, for 
 easier use in processors and builders
 ---

 Key: LUCENE-1820
 URL: https://issues.apache.org/jira/browse/LUCENE-1820
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/queryparser
Reporter: Luis Alves
Assignee: Michael Busch
Priority: Minor
 Fix For: 4.2


 Change the  WildcardQueryNode to expose the positions of the wildcard 
 characters.
 This would allow the AllowLeadingWildcardProcessor  not to have to knowledge 
 about the wildcard chars * and ? and avoid double check again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2735) First Cut at GroupVarInt with FixedIntBlockIndexInput / Output

2013-01-14 Thread Steve Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-2735:
---

Fix Version/s: (was: 4.1)
   4.2

 First Cut at GroupVarInt with FixedIntBlockIndexInput / Output
 --

 Key: LUCENE-2735
 URL: https://issues.apache.org/jira/browse/LUCENE-2735
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.0-ALPHA
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.2

 Attachments: LUCENE-2735_alt.patch, LUCENE-2735.patch, 
 LUCENE-2735.patch, LUCENE-2735.patch


 I have hacked together a FixedIntBlockIndex impl with Group VarInt encoding - 
 this does way worse than standard codec in benchmarks but I guess that is 
 mainly due to the FixedIntBlockIndex limitations. Once LUCENE-2723 is in / or 
 builds with trunk again I will update and run some tests. The isolated 
 microbenchmark shows that there could be improvements over vint even in java 
 though and I am sure we can make it faster impl. wise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4016) Deduplication is broken by partial update

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552967#comment-13552967
 ] 

Shalin Shekhar Mangar commented on SOLR-4016:
-

bq. If the signature being generated was the unique key, then atomic updates 
should be able to proceed fine as long as the id field is specified (as should 
always be the case with atomic updates).

The patch that I committed throws an exception if an atomic update request 
contains fields that are used to compute the signature. An atomic update 
request which does not modify the signature, proceeds as normal. This way we 
make sure that a document never contains a wrong signature.

Do you agree that this is an acceptable compromise until a proper fix is in 
place?

 Deduplication is broken by partial update
 -

 Key: SOLR-4016
 URL: https://issues.apache.org/jira/browse/SOLR-4016
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.0
 Environment: Tomcat6 / Catalina on Ubuntu 12.04 LTS
Reporter: Joel Nothman
Assignee: Shalin Shekhar Mangar
  Labels: 4.0.1_Candidate
 Fix For: 4.1, 5.0

 Attachments: SOLR-4016-disallow-partial-update.patch, 
 SOLR-4016-disallow-partial-update.patch, SOLR-4016.patch


 The SignatureUpdateProcessorFactory used (primarily?) for deduplication does 
 not consider partial update semantics.
 The below uses the following solrconfig.xml excerpt:
 {noformat}
  updateRequestProcessorChain name=text_hash
processor class=solr.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldtext_hash/str
  bool name=overwriteDupesfalse/bool
  str name=fieldstext/str
  str name=signatureClasssolr.processor.TextProfileSignature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain
 {noformat}
 Firstly, the processor treats {noformat}{set: value}{noformat} as a 
 string and hashes it, instead of the value alone:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: {set: hello world'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:30}}
 ?xml version=1.0 encoding=UTF-8?responselst 
 name=responseHeaderint name=status0/intint name=QTime1/intlst 
 name=paramsstr name=qid:abcde/str/lst/lstresult name=response 
 numFound=1 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashad48c7ad60ac22cc/strlong 
 name=_version_1417247434224959488/long/doc/result
 /response
 $
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, text: hello world}}}'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:27}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hashb169c743d220da8d/strlong 
 name=_version_141724802221564/long/doc/result
 /response
 {noformat}
 Note the different text_hash value.
 Secondly, when updating a field other than those used to create the signature 
 (which I imagine is a more common use-case), the signature is recalculated 
 from no values:
 {noformat}
 $ curl '$URL/update?commit=true' -H 'Content-type:application/json' -d 
 '{add:{doc:{id: abcde, title: {set: new title'  curl 
 '$URL/select?q=id:abcde'
 {responseHeader:{status:0,QTime:39}}
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeaderint name=status0/intint 
 name=QTime1/intlst name=paramsstr 
 name=qid:abcde/str/lst/lstresult name=response numFound=1 
 start=0docstr name=idabcde/strstr name=texthello 
 world/strstr name=text_hash/strstr name=titlenew 
 title/strlong name=_version_1417248120480202752/long/doc/result
 /response
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 4.1 branch

2013-01-14 Thread Steve Rowe
branches/lucene_solr_4_1/ is open for business!

I'm going to change version strings in branch_4x from 4.1 to 4.2 now.

Steve

On Jan 14, 2013, at 10:13 AM, Steve Rowe sar...@gmail.com wrote:

 For anyone with pending patches: I plan on branching for 4.1 at around 1:00pm 
 US EST.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552968#comment-13552968
 ] 

Commit Tag Bot commented on LUCENE-3298:


[trunk commit] Michael McCandless
http://svn.apache.org/viewvc?view=revisionrevision=1433026

LUCENE-3298: FSTs can now be larger than 2GB, have more than 2B nodes


 FST has hard limit max size of 2.1 GB
 -

 Key: LUCENE-3298
 URL: https://issues.apache.org/jira/browse/LUCENE-3298
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, 
 LUCENE-3298.patch


 The FST uses a single contiguous byte[] under the hood, which in java is 
 indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
 internally encodes references to this array as vInt.
 We could switch this to a paged byte[] and make the far larger.
 But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4304) NPE in Solr SpellCheckComponent if more than one QueryConverter

2013-01-14 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552969#comment-13552969
 ] 

Jack Krupansky commented on SOLR-4304:
--

The issue is in this code:

{code}
  MapString, QueryConverter queryConverters = new HashMapString, 
QueryConverter();
  core.initPlugins(queryConverters,QueryConverter.class);

  //ensure that there is at least one query converter defined
  if (queryConverters.size() == 0) {
LOG.info(No queryConverter defined, using default converter);
queryConverters.put(queryConverter, new SpellingQueryConverter());
  }

  //there should only be one
  if (queryConverters.size() == 1) {
queryConverter = queryConverters.values().iterator().next();
IndexSchema schema = core.getSchema();
String fieldTypeName = (String) 
initParams.get(queryAnalyzerFieldType);
FieldType fieldType = schema.getFieldTypes().get(fieldTypeName);
Analyzer analyzer = fieldType == null ? new 
WhitespaceAnalyzer(core.getSolrConfig().luceneMatchVersion)
: fieldType.getQueryAnalyzer();
//TODO: There's got to be a better way!  Where's Spring when you need 
it?
queryConverter.setAnalyzer(analyzer);
  }
{code}

No else! And queryConverter is not initialized, except for that code path where 
there was zero or one QueryConverter class.


 NPE in Solr SpellCheckComponent if more than one QueryConverter
 ---

 Key: SOLR-4304
 URL: https://issues.apache.org/jira/browse/SOLR-4304
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Affects Versions: 4.0
Reporter: Jack Krupansky

 The Solr SpellCheckComponent uses only a single QueryConverter, but fails 
 with an NPE if more than one QueryConverter class is registered in 
 solrconfig.xml.
 Repro:
 1. Add to 4.0 example solrconfig.xml:
 queryConverter name=myQueryConverter-1 
 class=solr.SpellingQueryConverter/
 queryConverter name=myQueryConverter-2 class=solr.SuggestQueryConverter/
 2. Perform a spellcheck request:
 curl http://localhost:8983/solr/spell?q=testindent=true;
 3. Examine the NPE:
 ?xml version=1.0 encoding=UTF-8?
 response
 lst name=responseHeader
   int name=status500/int
   int name=QTime4/int
 /lst
 result name=response numFound=0 start=0
 /result
 lst name=error
   str name=tracejava.lang.NullPointerException
 at 
 org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:136)
 at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
 at 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
 at 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
 at 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
 at 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
 at org.eclipse.jetty.server.Server.handle(Server.java:351)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
 at 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
 at 
 

[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Patch, fixing that bug Shai found.

Performance is better with this specialization:
{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  192.61  (4.5%)  193.06  (4.2%)
0.2% (  -8% -9%)
 LowTerm   15.33  (1.6%)   15.44  (2.5%)
0.7% (  -3% -4%)
 MedTerm7.60  (0.7%)7.74  (1.8%)
1.9% (   0% -4%)
HighTerm3.85  (0.6%)3.97  (1.2%)
3.1% (   1% -4%)
{noformat}

I also tried the unrolling of the vInt loop but perf was strangely quite a bit 
worse..

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
 LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552978#comment-13552978
 ] 

Michael McCandless commented on LUCENE-4620:


I think we should just make a specialized accumulator/aggregator, for the 
counts-only-dgap-vint case: that could wouldn't need to populate an IntsRef and 
then make 2nd pass over the ords ... it'd just increment the count for each ord 
as it decodes in.  In previous issues I already tested that this gives a good 
gain ...


 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
 LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552994#comment-13552994
 ] 

Michael McCandless commented on LUCENE-4676:


This explanation makes perfect sense!  Thanks for digging Simon.  +1 to just 
use NoMergePolicy.

 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1

 Attachments: LUCENE-4676.patch


 Revision: 1431169
 ant test  -Dtestcase=TestNRTManager 
 -Dtests.method=testThreadStarvationNoDeleteNRTReader 
 -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA 
 -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII 
 -Dtests.dups=500

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: 4.1 branch

2013-01-14 Thread Steve Rowe

On Jan 14, 2013, at 1:33 PM, Steve Rowe sar...@gmail.com wrote:

 branches/lucene_solr_4_1/ is open for business!
 
 I'm going to change version strings in branch_4x from 4.1 to 4.2 now.

Done.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553044#comment-13553044
 ] 

Commit Tag Bot commented on LUCENE-4676:


[trunk commit] Simon Willnauer
http://svn.apache.org/viewvc?view=revisionrevision=1433079

LUCENE-4676: Use NoMergePolicy in starvation test to prevent buffered deletes 
pruning


 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1

 Attachments: LUCENE-4676.patch


 Revision: 1431169
 ant test  -Dtestcase=TestNRTManager 
 -Dtests.method=testThreadStarvationNoDeleteNRTReader 
 -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA 
 -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII 
 -Dtests.dups=500

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2592) Custom Hashing

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553058#comment-13553058
 ] 

Commit Tag Bot commented on SOLR-2592:
--

[trunk commit] Yonik Seeley
http://svn.apache.org/viewvc?view=revisionrevision=1433082

SOLR-2592: changes entry for doc routing


 Custom Hashing
 --

 Key: SOLR-2592
 URL: https://issues.apache.org/jira/browse/SOLR-2592
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Affects Versions: 4.0-ALPHA
Reporter: Noble Paul
Assignee: Yonik Seeley
 Fix For: 4.1

 Attachments: dbq_fix.patch, pluggable_sharding.patch, 
 pluggable_sharding_V2.patch, SOLR-2592_collectionProperties.patch, 
 SOLR-2592_collectionProperties.patch, SOLR-2592.patch, 
 SOLR-2592_progress.patch, SOLR-2592_query_try1.patch, 
 SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, 
 SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch


 If the data in a cloud can be partitioned on some criteria (say range, hash, 
 attribute value etc) It will be easy to narrow down the search to a smaller 
 subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-2592) Custom Hashing

2013-01-14 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-2592.


   Resolution: Fixed
Fix Version/s: 5.0

 Custom Hashing
 --

 Key: SOLR-2592
 URL: https://issues.apache.org/jira/browse/SOLR-2592
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Affects Versions: 4.0-ALPHA
Reporter: Noble Paul
Assignee: Yonik Seeley
 Fix For: 4.1, 5.0

 Attachments: dbq_fix.patch, pluggable_sharding.patch, 
 pluggable_sharding_V2.patch, SOLR-2592_collectionProperties.patch, 
 SOLR-2592_collectionProperties.patch, SOLR-2592.patch, 
 SOLR-2592_progress.patch, SOLR-2592_query_try1.patch, 
 SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, 
 SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch


 If the data in a cloud can be partitioned on some criteria (say range, hash, 
 attribute value etc) It will be easy to narrow down the search to a smaller 
 subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2894) Implement distributed pivot faceting

2013-01-14 Thread Chris Russell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Russell updated SOLR-2894:


Attachment: SOLR-2894.patch

Corrected null aggregation issues when docs contain null values for fields 
pivoting on. Added logic to remove local params from pivot QS vars when 
determining over-request.

 Implement distributed pivot faceting
 

 Key: SOLR-2894
 URL: https://issues.apache.org/jira/browse/SOLR-2894
 Project: Solr
  Issue Type: Improvement
Reporter: Erik Hatcher
 Fix For: 4.2, 5.0

 Attachments: SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, 
 SOLR-2894.patch, SOLR-2894.patch, SOLR-2894.patch, SOLR-2894-reworked.patch


 Following up on SOLR-792, pivot faceting currently only supports 
 undistributed mode.  Distributed pivot faceting needs to be implemented.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3858) Doc-to-shard assignment based on range property on shards

2013-01-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553064#comment-13553064
 ] 

Yonik Seeley commented on SOLR-3858:


SOLR-3755 took care of most of this, but the shard splitting code still needs 
to use the collection specific doc router.

 Doc-to-shard assignment based on range property on shards
 ---

 Key: SOLR-3858
 URL: https://issues.apache.org/jira/browse/SOLR-3858
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley

 Anything that maps a document id to a shard should consult the ranges defined 
 on the shards (currently indexing and real-time get).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2592) Custom Hashing

2013-01-14 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13553066#comment-13553066
 ] 

Commit Tag Bot commented on SOLR-2592:
--

[branch_4x commit] Yonik Seeley
http://svn.apache.org/viewvc?view=revisionrevision=1433084

SOLR-2592: changes entry for doc routing


 Custom Hashing
 --

 Key: SOLR-2592
 URL: https://issues.apache.org/jira/browse/SOLR-2592
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Affects Versions: 4.0-ALPHA
Reporter: Noble Paul
Assignee: Yonik Seeley
 Fix For: 4.1, 5.0

 Attachments: dbq_fix.patch, pluggable_sharding.patch, 
 pluggable_sharding_V2.patch, SOLR-2592_collectionProperties.patch, 
 SOLR-2592_collectionProperties.patch, SOLR-2592.patch, 
 SOLR-2592_progress.patch, SOLR-2592_query_try1.patch, 
 SOLR-2592_r1373086.patch, SOLR-2592_r1384367.patch, SOLR-2592_rev_2.patch, 
 SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch


 If the data in a cloud can be partitioned on some criteria (say range, hash, 
 attribute value etc) It will be easy to narrow down the search to a smaller 
 subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >