date:20130114

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552497#comment-13552497
]

Dawid Weiss commented on LUCENE-4682:
-

Yeah, there are many ideas layered on top of each other and it's gotten beyond
the point of being easy to comprehend.

As for the next bit -- in any implementation I've seen this leads to
significant reduction in automaton size. But I'm not saying it's the optimal
way to do it, perhaps there are other encoding options that would reach similar
compression levels without the added complexity.

Reduce wasted bytes in FST due to array arcs

Key: LUCENE-4682
URL: https://issues.apache.org/jira/browse/LUCENE-4682
Project: Lucene - Core
Issue Type: Improvement
Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch

When a node is close to the root, or it has many outgoing arcs, the FST
writes the arcs as an array (each arc gets N bytes), so we can e.g. bin
search on lookup.
The problem is N is set to the max(numBytesPerArc), so if you have an outlier
arc e.g. with a big output, you can waste many bytes for all the other arcs
that didn't need so many bytes.
I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size
1535612 = ~18% wasted.
It would be nice to reduce this.
One thing we could do without packing is: in addNode, if we detect that
number of wasted bytes is above some threshold, then don't do the expansion.
Another thing, if we are packing: we could record stats in the first pass
about which nodes wasted the most, and then in the second pass (paack) we
could set the threshold based on the top X% nodes that waste ...
Another idea is maybe to deref large outputs, so that the numBytesPerArc is
more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB


[ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552498#comment-13552498
 ] 

Dawid Weiss commented on LUCENE-3298:
-

The impact will show on 32-bit systems I'm pretty sure of that. We don't care 
about hardware archaeology, do we? :)
+1.

 FST has hard limit max size of 2.1 GB
 -

 Key: LUCENE-3298
 URL: https://issues.apache.org/jira/browse/LUCENE-3298
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch, 
 LUCENE-3298.patch


 The FST uses a single contiguous byte[] under the hood, which in java is 
 indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
 internally encodes references to this array as vInt.
 We could switch this to a paged byte[] and make the far larger.
 But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment

[
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shai Erera updated LUCENE-4683:
---

Attachment: LUCENE-4683.patch

* Added setNextReader to CategoryListIterator (instead of init()) and
Aggregator.

* Modified StandardFacetsAccumulator to iterate of the segment's atomic readers
and call setNextReader accordingly.

* Fixed an issue in ScoredDocIdsUtils where it assumed ScoredDocIDs are
OpenBitSet where for a long time they are FixedBitSet. This caused unnecessary
copy from FixedBitSet to OpenBitSet.

* Most of the other changes are API changes, i.e. createCategoryListIterator no
longer takes an IndexReader etc.

I didn't add yet a CHANGES line because I'm not sure if this will make it into
4.1. Basically it's ready to go in (all tests pass), so I'll check later today
what's the status of the 4.1 branch and decide accordingly.

This now makes the cutover to DocValues even easier. That's what I'd like to do
next.

Change Aggregator and CategoryListIterator to be per-segment

Key: LUCENE-4683
URL: https://issues.apache.org/jira/browse/LUCENE-4683
Project: Lucene - Core
Issue Type: Improvement
Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Attachments: LUCENE-4683.patch

As another improvement, these two (mostly CategoryListIterator) should be
per-segment. I've got a patch nearly ready, will post tomorrow.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shahar Davidson (JIRA)

Shahar Davidson created SOLR-4302:
-

 Summary: Improve CoreAdmin STATUS request response time by 
allowing user to omit the Index info
 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Priority: Minor


In large multicore environments (hundreds+ of cores), the STATUS request may 
take a fair amount of time.
It seems that the majority of time is spent retrieving the index related info.

The suggested improvement allows the user to specify a parameter (indexInfo) 
that if 'false' then index related info (such as segmentCount, sizeInBytes, 
numDocs, etc.) will not be retrieved. By default, the indexInfo will be 'true' 
(to maintain existing STATUS request behavior).

For example, when tested on a given machine with 380+ solr cores, the full 
STATUS request took 800ms-900ms, whereas using indexInfo=false returned results 
in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shahar Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shahar Davidson updated SOLR-4302:
--

Attachment: SOLR-4302.patch

SOLR-4302, apply over trunk 1404975

 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shahar Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552544#comment-13552544
 ] 

Shahar Davidson edited comment on SOLR-4302 at 1/14/13 10:00 AM:
-

Attached suggested patch SOLR-4302.patch. apply over trunk 1404975

  was (Author: shahar.davidson):
SOLR-4302, apply over trunk 1404975
  
 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: looking for package org.apache.lucene.analysis.standard

2013-01-14 Thread JimAld

Thanks to everyone, I feel I'm getting somewhere, but not quite there yet. I
currently have the below in my pom. When I change my import to: 
import org.apache.lucene.queryparser.classic.QueryParser;
Eclipse says it can't find org.apache.lucene.queryparser however, the
maven installer has no such issue. 

The maven installer, does however have an issue with this line:
Analyzer analyzer = new StandardAnalyzer();
It says: 
cannot find symbol
symbol  : constructor StandardAnalyzer()
location: class org.apache.lucene.analysis.standard.StandardAnalyzer
Even though I have the import:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
Which Eclipse has no issue with. 

I've cleaned my project and restarted Eclipse with no improvement to the
differences shown by Eclipse and Maven. Any help much appreciated!

Pom dependencies:
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-core/artifactId
version4.0.0/version
scopeprovided/scope
/dependency
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-analyzers-common/artifactId
version4.0.0/version
scopeprovided/scope
/dependency
dependency
groupIdorg.apache.lucene/groupId
artifactIdlucene-queryparser/artifactId
version4.0.0/version
scopeprovided/scope
/dependency



--
View this message in context: 
http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033104.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API


[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552573#comment-13552573
 ] 

Michael McCandless commented on LUCENE-4620:


This change seemed to lose a bit of performance: look at 1/11/2013 on 
http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html

But, that tests just one dimension (Date), with only 3 ords per doc,
so I had assumed that this just wasn't enough ints being decoded to
see the gains from this bulk decoding.

So, I modified luceneutil to have more facets per doc (avg ~25 ords
per doc across 9 dimensions; 2.5M unique ords), and the results are
still slower:

{noformat}
  TaskQPS base  StdDevQPS comp  StdDevPct 
diff
  HighTerm3.62  (2.5%)3.24  (1.0%)  -10.5% ( -13% -   
-7%)
   MedTerm7.34  (1.7%)6.78  (0.9%)   -7.6% ( -10% -   
-5%)
   LowTerm   14.92  (1.6%)   14.32  (1.2%)   -4.0% (  -6% -   
-1%)
  PKLookup  181.47  (4.7%)  183.04  (5.3%)0.9% (  -8% -   
11%)
{noformat}

This is baffling ... not sure what's up.  I would expect some gains
given that the micro-benchmark showed sizable decode improvements.  It
must somehow be that decode cost is a minor part of facet counting?
(which is not a good sign!: it should be a big part of it...)


 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4676) IndexReader.isCurrent race

2013-01-14 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-4676:
---

Assignee: Simon Willnauer

 IndexReader.isCurrent race
 --

 Key: LUCENE-4676
 URL: https://issues.apache.org/jira/browse/LUCENE-4676
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir
Assignee: Simon Willnauer
 Fix For: 4.1


 Revision: 1431169
 ant test  -Dtestcase=TestNRTManager 
 -Dtests.method=testThreadStarvationNoDeleteNRTReader 
 -Dtests.seed=925ECD106FBFA3FF -Dtests.slow=true -Dtests.locale=fr_CA 
 -Dtests.timezone=America/Kentucky/Louisville -Dtests.file.encoding=US-ASCII 
 -Dtests.dups=500

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

[
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552594#comment-13552594
]

Shai Erera commented on LUCENE-4620:

I'm baffled too. There is some overhead with the bulk API, in that it needs to
{{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I
believe that this growing should stabilize after few docs (i.e. the array
becomes large enough). Still, every iteration checks if the array is large
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe
side it can grow by buf.length, as at the minimum each value occupies only one
byte. Some other decoders are trickier, but they are not in effect in your test
above.

But I must admit that I thought it's a no brainer that replacing an iterator
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}}
shows nice improvements already. And even if decoding values is not the major
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most
we shouldn't see big wins?

Explore IntEncoder/Decoder bulk API
---

Key: LUCENE-4620
URL: https://issues.apache.org/jira/browse/LUCENE-4620
Project: Lucene - Core
Issue Type: Improvement
Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Fix For: 4.1, 5.0

Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch

Today, IntEncoder/Decoder offer a streaming API, where you can encode(int)
and decode(int). Originally, we believed that this layer can be useful for
other scenarios, but in practice it's used only for writing/reading the
category ordinals from payload/DV.
Therefore, Mike and I would like to explore a bulk API, something like
encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder
can still be streaming (as we don't know in advance how many ints will be
written), dunno. Will figure this out as we go.
One thing to check is whether the bulk API can work w/ e.g. facet
associations, which can write arbitrary byte[], and so may decoding to an
IntsRef won't make sense. This too we'll figure out as we go. I don't rule
out that associations will use a different bulk API.
At the end of the day, the requirement is for someone to be able to configure
how ordinals are written (i.e. different encoding schemes: VInt, PackedInts
etc.) and later read, with as little overhead as possible.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4620) Explore IntEncoder/Decoder bulk API


[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552594#comment-13552594
 ] 

Shai Erera edited comment on LUCENE-4620 at 1/14/13 11:51 AM:
--

I'm baffled too. There is some overhead with the bulk API, in that it needs to 
{{grow()}} the {{IntsRef}} (something it didn't need to do before). But I 
believe that this growing should stabilize after few docs (i.e. the array 
becomes large enough). Still, every iteration checks if the array is large 
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can 
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it 
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe 
side it can grow by buf.length, as at the minimum each value occupies only one 
byte. Some other decoders are trickier, but they are not in effect in your test 
above.

But I must admit that I thought it's a no brainer that replacing an iterator 
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} 
shows nice improvements already. And even if decoding values is not the major 
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most 
we shouldn't see big wins?

  was (Author: shaie):
I'm baffled too. There is some overhead with the bulk API, in that it needs 
to {{grow()}} the {{IntsBuffer}} (something it didn't need to do before). But I 
believe that this growing should stabilize after few docs (i.e. the array 
becomes large enough). Still, every iteration checks if the array is large 
enough, so perhaps if we grow the IntsRef upfront (even if too much), we can 
remove the 'ifs'.

SimpleIntDecoder can do it easily, it knows there are 4 bytes per value, so it 
should just grow by buf.length / 4. VInt is more tricky, but to be on the safe 
side it can grow by buf.length, as at the minimum each value occupies only one 
byte. Some other decoders are trickier, but they are not in effect in your test 
above.

But I must admit that I thought it's a no brainer that replacing an iterator 
API by a bulk is going to improve performance. And indeed, {{EncodingSpeed}} 
shows nice improvements already. And even if decoding values is not the major 
part of faceted search (which I doubt), we shouldn't see slowdowns? At the most 
we shouldn't see big wins?
  
 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment


[ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552599#comment-13552599
 ] 

Commit Tag Bot commented on LUCENE-4683:


[trunk commit] Shai Erera
http://svn.apache.org/viewvc?view=revisionrevision=1432890

LUCENE-4683: Change Aggregator and CategoryListIterator to be per-segment


 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment


 [ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-4683.


   Resolution: Fixed
Fix Version/s: 5.0
   4.1

I ran tests few times and all was quiet. Committed to trunk and 4x (add CHANGES 
too).

 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4683) Change Aggregator and CategoryListIterator to be per-segment


[ 
https://issues.apache.org/jira/browse/LUCENE-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552601#comment-13552601
 ] 

Commit Tag Bot commented on LUCENE-4683:


[branch_4x commit] Shai Erera
http://svn.apache.org/viewvc?view=revisionrevision=1432894

LUCENE-4683: Change Aggregator and CategoryListIterator to be per-segment


 Change Aggregator and CategoryListIterator to be per-segment
 

 Key: LUCENE-4683
 URL: https://issues.apache.org/jira/browse/LUCENE-4683
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4683.patch


 As another improvement, these two (mostly CategoryListIterator) should be 
 per-segment. I've got a patch nearly ready, will post tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4321) java.io.FilterReader considered harmful

2013-01-14 Thread Artem Lukanin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Lukanin updated LUCENE-4321:
--

Attachment: NoRandomReadMockTokenizer.java

I had to extend MockTokenizer, because I read the buffer completely to decide, 
what to do with the input (convert or not to something else).

When you use different reading methods randomly, my tests don't pass. If you 
used the same method (may be different) for the complete input string, they 
would pass, but now the output string is messed up, becase some parts of the 
input are converted and some are not.

 java.io.FilterReader considered harmful
 ---

 Key: LUCENE-4321
 URL: https://issues.apache.org/jira/browse/LUCENE-4321
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0-BETA
Reporter: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 NoRandomReadMockTokenizer.java


 See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39
 Reader.java is fine, it has lots of methods like read(), read(char[]), 
 read(CharBuffer), skip(), but these all have default implementations 
 delegating to read(char[], int, int).
 Unfortunately FilterReader delegates too many unnecessary things such as 
 read() and skip() in a broken way. It should have just left these alone.
 This can cause traps for someone upgrading because they have to override 
 multiple methods, when read(char[], int, int) should be enough, and all 
 Reader methods will then work correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar reassigned SOLR-4302:
---

Assignee: Shalin Shekhar Mangar

 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API


[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552612#comment-13552612
 ] 

Shai Erera commented on LUCENE-4620:


I made this change to VInt8IntDecoder instead of checking inside the loop:

{code}
int numValues = buf.length; // a value occupies at least 1 byte
if (values.ints.length  numValues) {
  values.grow(numValues);
}
{code}

Ran EncodingSpeed again and compared the results. On average (4 datasets), 
VInt8 achieves a 0.69% speedup, DGap(VInt) 7.85% and 
Sorting(Unique(DGap(VInt))) 10.16%. The last one is the default Encoder, 
thought its decoder is only DGap(VInt), so I'm not sure why the difference 
between that run and the previous one with 7.85%.

However, it does look like it speeds things up...

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4321) java.io.FilterReader considered harmful

2013-01-14 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552613#comment-13552613
 ] 

Robert Muir commented on LUCENE-4321:
-

Your charfilter is broken.

 java.io.FilterReader considered harmful
 ---

 Key: LUCENE-4321
 URL: https://issues.apache.org/jira/browse/LUCENE-4321
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.0-BETA
Reporter: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 LUCENE-4321.patch, LUCENE-4321.patch, LUCENE-4321.patch, 
 NoRandomReadMockTokenizer.java


 See Dawid's email: http://find.searchhub.org/document/64b0a28c53faf39
 Reader.java is fine, it has lots of methods like read(), read(char[]), 
 read(CharBuffer), skip(), but these all have default implementations 
 delegating to read(char[], int, int).
 Unfortunately FilterReader delegates too many unnecessary things such as 
 read() and skip() in a broken way. It should have just left these alone.
 This can cause traps for someone upgrading because they have to override 
 multiple methods, when read(char[], int, int) should be enough, and all 
 Reader methods will then work correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552614#comment-13552614
]

Michael McCandless commented on LUCENE-4682:

I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and
it was a big hit:

* With NEXT: 59267794 bytes

* Without NEXT: 82543993 bytes

So FST would be ~39% larger if we remove NEXT ... however lookup sped up from
726 ns per lookup to 636 ns. But: we could get this speedup today, if we just
fixed setting of a NEXT arc's target to be lazy instead. Today it's very
costly for non-array arcs because we scan to the end of all nodes to set the
target, even if the caller isn't going to use it, which is really ridiculous.

I also tested delta-coding the arc target instead of the abs vInt we have
today ... it wasn't a real test, instead I just measured how many bytes the
vInt delta would be vs how many bytes the vInt abs it today, and the results
were disappointing:

* Abs vInt (what we do today): 26681349 bytes

* Delta vInt: 25479316 bytes

Which is surprising ... I guess we don't see much locality for the nodes ...
or, eg the common suffixes freeze early on and then lots of future nodes refer
to them.

Maybe, we can find a way to do NEXT without the confusing
per-node-reverse-bytes?

Reduce wasted bytes in FST due to array arcs

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552619#comment-13552619
]

Robert Muir commented on LUCENE-4682:
-

{quote}
So FST would be ~39% larger if we remove NEXT
{quote}

But according to your notes above, we have 28% waste for this (with a long
output).
Are we making the right tradeoff?

{quote}
Maybe, we can find a way to do NEXT without the confusing
per-node-reverse-bytes?
{quote}

Or, not do it at all if we cant figure it out? The reversing holds back other
improvements so
benchmarking it by itself could be misleading.

Reduce wasted bytes in FST due to array arcs

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info


[ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552621#comment-13552621
 ] 

Commit Tag Bot commented on SOLR-4302:
--

[trunk commit] Shalin Shekhar Mangar
http://svn.apache.org/viewvc?view=revisionrevision=1432901

SOLR-4302: New parameter 'indexInfo' (defaults to true) in CoreAdmin STATUS 
command can be used to omit index specific information


 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552624#comment-13552624
]

Dawid Weiss commented on LUCENE-4682:
-

bq. I also tested delta-coding the arc target instead of the abs vInt we have
today ...

I did such experiments when I was working on that paper. Remember -- you don't
publish negative results, unfortunately. Obviously I didn't try everything but:
1) NEXT was important, 2) the structure of the FST doesn't yield to easy local
deltas; it's not easily separable and pointers typically jump all over.

bq. Which is surprising ... I guess we don't see much locality for the nodes
... or, eg the common suffixes freeze early on and then lots of future nodes
refer to them.

Not really that surprising -- you encode common suffixes after all so most of
them will appear in a properly sized sample. This is actually why the strategy
of moving nodes around works too -- you move those that are super frequent but
they'll most likely be reordered at the top suffix frequencies of the
automaton anyway.

Reduce wasted bytes in FST due to array arcs

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info

2013-01-14 Thread Shalin Shekhar Mangar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-4302.
-

   Resolution: Fixed
Fix Version/s: 5.0
   4.1

Committed to trunk and branch_4x.

Thanks Shahar!

 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Fix For: 4.1, 5.0

 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4302) Improve CoreAdmin STATUS request response time by allowing user to omit the Index info


[ 
https://issues.apache.org/jira/browse/SOLR-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552628#comment-13552628
 ] 

Commit Tag Bot commented on SOLR-4302:
--

[branch_4x commit] Shalin Shekhar Mangar
http://svn.apache.org/viewvc?view=revisionrevision=1432903

SOLR-4302: New parameter 'indexInfo' (defaults to true) in CoreAdmin STATUS 
command can be used to omit index specific information


 Improve CoreAdmin STATUS request response time by allowing user to omit the 
 Index info
 --

 Key: SOLR-4302
 URL: https://issues.apache.org/jira/browse/SOLR-4302
 Project: Solr
  Issue Type: Improvement
  Components: multicore
Affects Versions: 4.0, 4.1, 5.0
Reporter: Shahar Davidson
Assignee: Shalin Shekhar Mangar
Priority: Minor
  Labels: performance
 Fix For: 4.1, 5.0

 Attachments: SOLR-4302.patch


 In large multicore environments (hundreds+ of cores), the STATUS request may 
 take a fair amount of time.
 It seems that the majority of time is spent retrieving the index related info.
 The suggested improvement allows the user to specify a parameter (indexInfo) 
 that if 'false' then index related info (such as segmentCount, sizeInBytes, 
 numDocs, etc.) will not be retrieved. By default, the indexInfo will be 
 'true' (to maintain existing STATUS request behavior).
 For example, when tested on a given machine with 380+ solr cores, the full 
 STATUS request took 800ms-900ms, whereas using indexInfo=false returned 
 results in about 1ms-4ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552633#comment-13552633
]

Michael McCandless commented on LUCENE-4682:

{quote}
bq. So FST would be ~39% larger if we remove NEXT

But according to your notes above, we have 28% waste for this (with a long
output).
Are we making the right tradeoff?
{quote}

Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix
that I think we should use a skip list? Surely the bytes required to encode
the skip list are less than our waste today.

{quote}
bq. Maybe, we can find a way to do NEXT without the confusing
per-node-reverse-bytes?

Or, not do it at all if we cant figure it out? The reversing holds back other
improvements so
benchmarking it by itself could be misleading.
{quote}

I don't think we should drop NEXT unless we have some alternative? 39%
increase is size is non-trivial!

I know reversing held back delta-code of the node target, but, that looks like
it won't gain much. What else is it holding back?

Reduce wasted bytes in FST due to array arcs

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552635#comment-13552635
 ] 

Uwe Schindler commented on LUCENE-4570:
---

I started a google code project: http://code.google.com/p/forbidden-apis/

This is a fork with many new additions:
- auto-generated deprecated signature list (from rt.jar)
- support for bundled and project-maintained signature lists (like the 
deprecated ones for various JDK versions, the well known charset/locale/... 
violators)
- no direct ASM 4.1 dependency conflicting with other dependencies: The ASM 
library is jarjar'ed into the artifact
- _not yet:_ Comments for every signature thats printed in error message
- _not yet:_ Mäven support (Mojo)

Once there is a release (hopefully soon)

 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552635#comment-13552635
 ] 

Uwe Schindler edited comment on LUCENE-4570 at 1/14/13 1:05 PM:


I started a google code project: http://code.google.com/p/forbidden-apis/

This is a fork with many new additions:
- auto-generated deprecated signature list (from rt.jar)
- support for bundled and project-maintained signature lists (like the 
deprecated ones for various JDK versions, the well known charset/locale/... 
violators)
- no direct ASM 4.1 dependency conflicting with other dependencies: The ASM 
library is jarjar'ed into the artifact
- _not yet:_ Comments for every signature thats printed in error message
- _not yet:_ Mäven support (Mojo) - Selckin already started a fork in Github, 
but as the new project is almost a complete rewrite of the API (decouple ANT 
task from logic), I will need his help
- _not yet:_ Mäven Release, so IVY can download it

Once there is a release (hopefully soon), this can ivy:cachepath'ed and 
taskdef'ed into the Lucene build

  was (Author: thetaphi):
I started a google code project: http://code.google.com/p/forbidden-apis/

This is a fork with many new additions:
- auto-generated deprecated signature list (from rt.jar)
- support for bundled and project-maintained signature lists (like the 
deprecated ones for various JDK versions, the well known charset/locale/... 
violators)
- no direct ASM 4.1 dependency conflicting with other dependencies: The ASM 
library is jarjar'ed into the artifact
- _not yet:_ Comments for every signature thats printed in error message
- _not yet:_ Mäven support (Mojo)

Once there is a release (hopefully soon)
  
 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552641#comment-13552641
]

Robert Muir commented on LUCENE-4682:
-

{quote}
Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that
I think we should use a skip list? Surely the bytes required to encode the skip
list are less than our waste today.
{quote}

{quote}
I know reversing held back delta-code of the node target, but, that looks like
it won't gain much. What else is it holding back?
{quote}

I mean in general NEXT/reversing adds more complexity here which makes it
harder to try different things? Like a big doberman and BEWARE sign scaring off
developers :)

Its a big part of what sent me over the edge trying to refactor FST to have a
store abstraction (LUCENE-4593). But fortunately you did that anyway...

It would be really really really good for FSTs long term to do things like
remove reversing, remove packed (fold these optos or at least most of them in
by default), etc.

Reduce wasted bytes in FST due to array arcs

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4570) release policeman tools?


[ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552644#comment-13552644
 ] 

Dawid Weiss commented on LUCENE-4570:
-

Nice!

 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: looking for package org.apache.lucene.analysis.standard

2013-01-14 Thread Steve Rowe

Hi Jim,

Try getting rid of the scopeprovided/scope lines.

Steve
On Jan 14, 2013 5:38 AM, JimAld jim.alder...@db.com wrote:

 Thanks to everyone, I feel I'm getting somewhere, but not quite there yet.
 I
 currently have the below in my pom. When I change my import to:
 import org.apache.lucene.queryparser.classic.QueryParser;
 Eclipse says it can't find org.apache.lucene.queryparser however, the
 maven installer has no such issue.

 The maven installer, does however have an issue with this line:
 Analyzer analyzer = new StandardAnalyzer();
 It says:
 cannot find symbol
 symbol  : constructor StandardAnalyzer()
 location: class org.apache.lucene.analysis.standard.StandardAnalyzer
 Even though I have the import:
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 Which Eclipse has no issue with.

 I've cleaned my project and restarted Eclipse with no improvement to the
 differences shown by Eclipse and Maven. Any help much appreciated!

 Pom dependencies:
 dependency
 groupIdorg.apache.lucene/groupId
 artifactIdlucene-core/artifactId
 version4.0.0/version
 scopeprovided/scope
 /dependency
 dependency
 groupIdorg.apache.lucene/groupId
 artifactIdlucene-analyzers-common/artifactId
 version4.0.0/version
 scopeprovided/scope
 /dependency
 dependency
 groupIdorg.apache.lucene/groupId
 artifactIdlucene-queryparser/artifactId
 version4.0.0/version
 scopeprovided/scope
 /dependency



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/looking-for-package-org-apache-lucene-analysis-standard-tp4028789p4033104.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

[
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552653#comment-13552653
]

Michael McCandless commented on LUCENE-4682:

bq. I mean in general NEXT/reversing adds more complexity here which makes it
harder to try different things? Like a big doberman and BEWARE sign scaring off
developers

LOL :)

But yeah I agree.

bq. Its a big part of what sent me over the edge trying to refactor FST to have
a store abstraction (LUCENE-4593). But fortunately you did that anyway...

Right but it's not good if bus factor is 1 ... it's effectively dead code when
that happens.

bq. It would be really really really good for FSTs long term to do things like
remove reversing, remove packed (fold these optos or at least most of them in
by default), etc.

+1, except that NEXT buys us a much smaller FST now. We can't just drop it ...
we need some way to simplify it (eg somehow stop reversing).

Reduce wasted bytes in FST due to array arcs

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

[
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-4620:
---

Attachment: LUCENE-4620.patch

Maybe doing bulk-vInt-decode (see patch) will be faster (just make hotspot's
job easier) ... I'll test.

Explore IntEncoder/Decoder bulk API
---

Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch,
LUCENE-4620.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3298) FST has hard limit max size of 2.1 GB

[
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552656#comment-13552656
]

Michael McCandless commented on LUCENE-3298:

bq. The impact will show on 32-bit systems I'm pretty sure of that.

Yeah I think it will too ...

bq. We don't care about hardware archaeology, do we?

I think Lucene should continue to run on 32 bit hardware, but I don't think
performance on 32 bit is important, ie we should optimize for 64 bit
performance.

FST has hard limit max size of 2.1 GB
-

Key: LUCENE-3298
URL: https://issues.apache.org/jira/browse/LUCENE-3298
Project: Lucene - Core
Issue Type: Improvement
Components: core/FSTs
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Attachments: LUCENE-3298.patch, LUCENE-3298.patch, LUCENE-3298.patch,
LUCENE-3298.patch

The FST uses a single contiguous byte[] under the hood, which in java is
indexed by int so we cannot grow this over Integer.MAX_VALUE. It also
internally encodes references to this array as vInt.
We could switch this to a paged byte[] and make the far larger.
But I think this is low priority... I'm not going to work on it any time soon.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-4570) release policeman tools?

2013-01-14 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-4570:
-

Assignee: Uwe Schindler

 release policeman tools?
 

 Key: LUCENE-4570
 URL: https://issues.apache.org/jira/browse/LUCENE-4570
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Robert Muir
Assignee: Uwe Schindler

 Currently there is source code in lucene/tools/src (e.g. Forbidden APIs 
 checker ant task).
 It would be convenient if you could download this thing in your ant build 
 from ivy (especially if maybe it included our definitions .txt files as 
 resources).
 In general checking for locale/charset violations in this way is a pretty 
 general useful thing for a server-side app.
 Can we either release lucene-tools.jar as an artifact, or maybe alternatively 
 move this somewhere else as a standalone project and suck it in ourselves?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API


[ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552668#comment-13552668
 ] 

Shai Erera commented on LUCENE-4620:


I see. I have two comments about the patch. This part is wrong:

{code}
+int needed = upto - buf.offset;
+if (values.length  needed) {
+  values.grow(needed);
+}
{code}

should be

{code}
+if (values.ints.length  buf.length) {
+  values.grow(buf.length);
+}
{code}

Does it even run for you? because {{values.length = 0}} at start.

Also, note how this way you check offset  upto on every byte read while in the 
current code it's checked only once per integer read. Maybe if you do a while 
loop inside the loop, something like {{while (b  0)}}.

 Explore IntEncoder/Decoder bulk API
 ---

 Key: LUCENE-4620
 URL: https://issues.apache.org/jira/browse/LUCENE-4620
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch, 
 LUCENE-4620.patch


 Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
 and decode(int). Originally, we believed that this layer can be useful for 
 other scenarios, but in practice it's used only for writing/reading the 
 category ordinals from payload/DV.
 Therefore, Mike and I would like to explore a bulk API, something like 
 encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
 can still be streaming (as we don't know in advance how many ints will be 
 written), dunno. Will figure this out as we go.
 One thing to check is whether the bulk API can work w/ e.g. facet 
 associations, which can write arbitrary byte[], and so may decoding to an 
 IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
 out that associations will use a different bulk API.
 At the end of the day, the requirement is for someone to be able to configure 
 how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
 etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4620) Explore IntEncoder/Decoder bulk API