date:20100916


 [ 
https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2647:
---

Attachment: LUCENE-2647.patch

 Move  rename the terms dict, index, abstract postings out of 
 oal.index.codecs.standard
 ---

 Key: LUCENE-2647
 URL: https://issues.apache.org/jira/browse/LUCENE-2647
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2647.patch


 The terms dict components that current live under Standard codec
 (oal.index.codecs.standard.*) are in fact very generic, and in no way
 particular to the Standard codec.  Already we have many other codecs
 (sep, fixed int block, var int block, pulsing, appending) that re-use
 the terms dict writer/reader components.
 So I'd like to move these out into oal.index.codecs, and rename them:
   * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader
   * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader
   * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader
   * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader
   * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader
 With this move we have a nice reusable terms dict impl.  The terms
 index impl is still well-decoupled so eg we could [in theory] explore
 a variable gap terms index.
 Many codecs, I expect, don't need/want to implement their own terms
 dict
 There are no code/index format changes here, besides the renaming 
 fixing all imports/usages of the renamed class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard

Move  rename the terms dict, index, abstract postings out of 
oal.index.codecs.standard
---

 Key: LUCENE-2647
 URL: https://issues.apache.org/jira/browse/LUCENE-2647
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0
 Attachments: LUCENE-2647.patch

The terms dict components that current live under Standard codec
(oal.index.codecs.standard.*) are in fact very generic, and in no way
particular to the Standard codec.  Already we have many other codecs
(sep, fixed int block, var int block, pulsing, appending) that re-use
the terms dict writer/reader components.

So I'd like to move these out into oal.index.codecs, and rename them:

  * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader
  * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader
  * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader
  * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader
  * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader

With this move we have a nice reusable terms dict impl.  The terms
index impl is still well-decoupled so eg we could [in theory] explore
a variable gap terms index.

Many codecs, I expect, don't need/want to implement their own terms
dict

There are no code/index format changes here, besides the renaming 
fixing all imports/usages of the renamed class.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Build failed in Hudson: Lucene-3.x #116

2010-09-16 Thread Michael McCandless

On Wed, Sep 15, 2010 at 8:43 PM, Robert Muir rcm...@gmail.com wrote:

 I wonder if now that we vary these in the tests anyway, if we should
 consider commenting out the Localized/MultiCodec runners?

 We could keep them available (but not used) in case you want to quickly run
 a test under every single Locale/Codec

+1

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes

[
https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910072#action_12910072
]

Michael McCandless commented on LUCENE-2588:

After I commit the simple renaming of standard codec's terms dicts
(LUCENE-2647), I plan to make this suffix-stripping opto private to
StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method
that can alter the indexed term before it's written).

Since StandardCodec hardwires the term sort to unicode order, the opto is safe
there.

In general, if a codec uses a different term sort (such as this test's codec)
it's conceivable a different opto could apply. EG I think this test could
prune suffix based on the term after the index term. But, it makes no sense to
spend time exploring this until a real use case arrives... this is just a
simple test to assert that a codec is in fact free to customize the sort order.

Also, there are other fun optos we could explore w/ terms index. EG we could
wiggle the index term selection a bit, so it wouldn't be fixed to every N, to
try to find terms that are small after removing the useless suffix.
Separately, we could choose index terms according to docFreq -- eg one simple
policy would be to plant an index term on term X if either 1) term X's docFreq
is over a threshold, or, 2) it's been N terms since the last indexed terms.
This could be a powerful way to even further reduce RAM usage of the terms
index, because it'd ensure that high cost terms (ie, many docs/freqs/positions
to visit) are in fact fast to lookup. The low freq terms can afford a higher
seek time since it'll be so fast to enum the docs.

terms index should not store useless suffixes
-

Key: LUCENE-2588
URL: https://issues.apache.org/jira/browse/LUCENE-2588
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2588.patch, LUCENE-2588.patch

This idea came up when discussing w/ Robert how to improve our terms index...
The terms dict index today simply grabs whatever term was at a 0 mod 128
index (by default).
But this is wasteful because you often don't need the suffix of the term at
that point.
EG if the 127th term is aa and the 128th (indexed) term is abcd123456789,
instead of storing that full term you only need to store ab. The suffix is
useless, and uses up RAM since we load the terms index into RAM.
The patch is very simple. The optimization is particularly easy because
terms are now byte[] and we sort in binary order.
I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index
(tii) file from 3.9 MB - 3.3 MB = 16% smaller (using StandardAnalyzer,
indexing body field tokenized but title / date fields untokenized). I expect
on noisier terms dicts, especially ones w/ bad terms accidentally indexed,
that the savings will be even more.
In the future we could do crazier things. EG there's no real reason why the
indexed terms must be regular (every N terms), so, we could instead pick
terms more carefully, say approximately every N, but favor terms that have
a smaller net prefix. We can also index more sparsely in regions where the
net docFreq is lowish, since we can afford somewhat higher seek+scan time to
these terms since enuming their docs will be much faster.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Build failed in Hudson: Lucene-3.x #116

2010-09-16 Thread Simon Willnauer

On Thu, Sep 16, 2010 at 11:56 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Wed, Sep 15, 2010 at 8:43 PM, Robert Muir rcm...@gmail.com wrote:

 I wonder if now that we vary these in the tests anyway, if we should
 consider commenting out the Localized/MultiCodec runners?

 We could keep them available (but not used) in case you want to quickly run
 a test under every single Locale/Codec

 +1
Yep  +1

 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard


[ 
https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910081#action_12910081
 ] 

Simon Willnauer commented on LUCENE-2647:
-

Mike, I think renaming is a good idea - that might make things slightly easier 
for folks to play around with codec 

here are some comments on the naming:

bq.StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader
+1

bq. StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader
What about TermsIndexWriter/ReaderBase since we started using that scheme with 
analyzers and the JDK uses that too. If we remove the abstractness one day the 
name is very miss-leading but the property of being a base class will likely 
remain.

bq. SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader
I really don't like Simple* its like Smart which makes me immediately feel 
itchy all over the place. What differentiates this from others? It is the 
default? maybe DefaultTermsIndexWriter/Reader? 

bq. StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader
Again, what about PostingWriter/ReaderBase

bq. StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader
+1

 Move  rename the terms dict, index, abstract postings out of 
 oal.index.codecs.standard
 ---

 Key: LUCENE-2647
 URL: https://issues.apache.org/jira/browse/LUCENE-2647
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2647.patch


 The terms dict components that current live under Standard codec
 (oal.index.codecs.standard.*) are in fact very generic, and in no way
 particular to the Standard codec.  Already we have many other codecs
 (sep, fixed int block, var int block, pulsing, appending) that re-use
 the terms dict writer/reader components.
 So I'd like to move these out into oal.index.codecs, and rename them:
   * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader
   * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader
   * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader
   * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader
   * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader
 With this move we have a nice reusable terms dict impl.  The terms
 index impl is still well-decoupled so eg we could [in theory] explore
 a variable gap terms index.
 Many codecs, I expect, don't need/want to implement their own terms
 dict
 There are no code/index format changes here, besides the renaming 
 fixing all imports/usages of the renamed class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes

[
https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910084#action_12910084
]

Simon Willnauer commented on LUCENE-2588:
-

{quote}
After I commit the simple renaming of standard codec's terms dicts
(LUCENE-2647), I plan to make this suffix-stripping opto private to
StandardCodec (I think by refactoring SimpleTermsIndexWriter to add a method
that can alter the indexed term before it's written).
{quote}
Mike what about factoring out a method like
{code}
protected short indexTermPrefixLen(BytesRef lastTerm, BytesRef currentTerm){
...
}

{code}

then we can simply override that method if there is a comparator which can not
utilize / breaks this opto?

terms index should not store useless suffixes
-

Attachments: LUCENE-2588.patch, LUCENE-2588.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard

[
https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910089#action_12910089
]

Michael McCandless commented on LUCENE-2647:

bq. What about TermsIndexWriter/ReaderBase since we started using that scheme
with analyzers and the JDK uses that too.

OK I'll switch from Abstract* - *Base.

{quote}
bq. SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader

I really don't like Simple* its like Smart which makes me immediately feel
itchy all over the place.
{quote}

Heh OK.

bq. What differentiates this from others? It is the default? maybe
DefaultTermsIndexWriter/Reader?

Well... there are no others yet! So, its is the default for now, but, I
don't like baking that into its name...

Lesse... so this one uses packed ints, to write the RAM image required at
search time, so that at search time we just slurp in these pre-built images.
While the index term selection policy is now fixed (every N), I think this
may change with time (the policy should be easily separable from how the index
terms are written). Though, since we haven't yet done that separation, maybe I
simply name it FixedGapTermsIndexWriter/Reader? How's that?

Move rename the terms dict, index, abstract postings out of
oal.index.codecs.standard
---

Key: LUCENE-2647
URL: https://issues.apache.org/jira/browse/LUCENE-2647
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 4.0

Attachments: LUCENE-2647.patch

The terms dict components that current live under Standard codec
(oal.index.codecs.standard.*) are in fact very generic, and in no way
particular to the Standard codec. Already we have many other codecs
(sep, fixed int block, var int block, pulsing, appending) that re-use
the terms dict writer/reader components.
So I'd like to move these out into oal.index.codecs, and rename them:
* StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader
* StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader
* SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader
* StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader
* StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader
With this move we have a nice reusable terms dict impl. The terms
index impl is still well-decoupled so eg we could [in theory] explore
a variable gap terms index.
Many codecs, I expect, don't need/want to implement their own terms
dict
There are no code/index format changes here, besides the renaming
fixing all imports/usages of the renamed class.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2647) Move rename the terms dict, index, abstract postings out of oal.index.codecs.standard


[ 
https://issues.apache.org/jira/browse/LUCENE-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910095#action_12910095
 ] 

Simon Willnauer commented on LUCENE-2647:
-

bq. ...FixedGapTermsIndexWriter/Reader? How's that?
+1



 Move  rename the terms dict, index, abstract postings out of 
 oal.index.codecs.standard
 ---

 Key: LUCENE-2647
 URL: https://issues.apache.org/jira/browse/LUCENE-2647
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-2647.patch


 The terms dict components that current live under Standard codec
 (oal.index.codecs.standard.*) are in fact very generic, and in no way
 particular to the Standard codec.  Already we have many other codecs
 (sep, fixed int block, var int block, pulsing, appending) that re-use
 the terms dict writer/reader components.
 So I'd like to move these out into oal.index.codecs, and rename them:
   * StandardTermsDictWriter/Reader - PrefixCodedTermsWriter/Reader
   * StandardTermsIndexWriter/Reader - AbstractTermsIndexWriter/Reader
   * SimpleStandardTermsIndexWriter/Reader - SimpleTermsIndexWriter/Reader
   * StandardPostingsWriter/Reader - AbstractPostingsWriter/Reader
   * StandardPostingsWriterImpl/ReaderImple - StandardPostingsWriter/Reader
 With this move we have a nice reusable terms dict impl.  The terms
 index impl is still well-decoupled so eg we could [in theory] explore
 a variable gap terms index.
 Many codecs, I expect, don't need/want to implement their own terms
 dict
 There are no code/index format changes here, besides the renaming 
 fixing all imports/usages of the renamed class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-792) Tree Faceting Component

2010-09-16 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910116#action_12910116
 ] 

Yonik Seeley commented on SOLR-792:
---

1.4.x is for bugfixes only.

 Tree Faceting Component
 ---

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Ryan McKinley
Priority: Minor
 Attachments: SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, 
 SOLR-792-PivotFaceting.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations


[ 
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910123#action_12910123
 ] 

Michael McCandless commented on LUCENE-2575:


bq. I'm not immediately sure what's reading the level at this end position of 
the byte[].

This is so that once we exhaust the slice and must allocate the next one we 
know what size (level + 1, ceiling'd) to make the next slice.

 Concurrent byte and int block implementations
 -

 Key: LUCENE-2575
 URL: https://issues.apache.org/jira/browse/LUCENE-2575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
 Fix For: Realtime Branch

 Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, 
 LUCENE-2575.patch


 The current *BlockPool implementations aren't quite concurrent.
 We really need something that has a locking flush method, where
 flush is called at the end of adding a document. Once flushed,
 the newly written data would be available to all other reading
 threads (ie, postings etc). I'm not sure I understand the slices
 concept, it seems like it'd be easier to implement a seekable
 random access file like API. One'd seek to a given position,
 then read or write from there. The underlying management of byte
 arrays could then be hidden?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Build failed in Hudson: Lucene-3.x #116

2010-09-16 Thread Robert Muir

Ok, I will create an issue.

For starters we could comment out these runners (for example, if code does
not work for a locale, it will fail 'eventually' due to the fact we pick a
random one anyway).

In the future maybe we could make this functionality easily triggerable with
a -D, in case you want to run a test class/method/entire test suite under
all Locales/Codecs

On Thu, Sep 16, 2010 at 6:23 AM, Simon Willnauer 
simon.willna...@googlemail.com wrote:

 On Thu, Sep 16, 2010 at 11:56 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
  On Wed, Sep 15, 2010 at 8:43 PM, Robert Muir rcm...@gmail.com wrote:
 
  I wonder if now that we vary these in the tests anyway, if we should
  consider commenting out the Localized/MultiCodec runners?
 
  We could keep them available (but not used) in case you want to quickly
 run
  a test under every single Locale/Codec
 
  +1
 Yep  +1
 
  Mike
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Robert Muir
rcm...@gmail.com

[jira] Commented: (SOLR-236) Field collapsing

2010-09-16 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910131#action_12910131
 ] 

Yonik Seeley commented on SOLR-236:
---

bq. It works great but gives problem when I include other components like Facet 
and Highlighter.

See the list of sub-tasks on this issue starting with SearchGrouping:.
I fixed faceting yesterday - and I hope to fix highlighting and debugging today.

 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Assignee: Shalin Shekhar Mangar
 Fix For: Next

 Attachments: collapsing-patch-to-1.3.0-dieter.patch, 
 collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, 
 collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, 
 field-collapse-3.patch, field-collapse-4-with-solrj.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, 
 field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, 
 field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, 
 field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, 
 quasidistributed.additional.patch, SOLR-236-1_4_1.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
 SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, 
 SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, 
 SOLR-236_collapsing.patch, SOLR-236_collapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2588) terms index should not store useless suffixes

2010-09-16 Thread Robert Muir (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910140#action_12910140
]

Robert Muir commented on LUCENE-2588:
-

{quote}
Also, there are other fun optos we could explore w/ terms index. EG we could
wiggle the index term selection a bit, so it wouldn't be fixed to every N, to
try to find terms that are small after removing the useless suffix. Separately,
we could choose index terms according to docFreq - eg one simple policy would
be to plant an index term on term X if either 1) term X's docFreq is over a
threshold, or, 2) it's been N terms since the last indexed terms. This could
be a powerful way to even further reduce RAM usage of the terms index, because
it'd ensure that high cost terms (ie, many docs/freqs/positions to visit) are
in fact fast to lookup. The low freq terms can afford a higher seek time since
it'll be so fast to enum the docs.
{quote}

it would be great to come up with a heuristic that balances all 3 of these:
Because selecting % 32 is silly if it would give you abracadabra when the
previous term is a and a fudge would give you a smaller index term (of course
it depends too, on what the next index term would be, and the docfreq
optimization too).

It sounds tricky, but right now we are just selecting index terms with no basis
at all (essentially random). then we are trying to deal with bad selections by
trimming wasted suffixes, etc.

terms index should not store useless suffixes
-

Attachments: LUCENE-2588.patch, LUCENE-2588.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2646) Iimplement the Military Grid Reference System for tiling

2010-09-16 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910157#action_12910157
 ] 

David Smiley commented on LUCENE-2646:
--

I hope you guys can make it to my session at LuceneRevolution at which I'll 
describe my geohash prefix filtering technique.  I'm working on an open-source 
contribution but the public release process is slow at MITRE.  I'm not yet 
employing a tiling technique but it's where I want to go.

 Iimplement the Military Grid Reference System for tiling
 

 Key: LUCENE-2646
 URL: https://issues.apache.org/jira/browse/LUCENE-2646
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/spatial
Reporter: Grant Ingersoll

 The current tile based system in Lucene is broken.  We should standardize on 
 a common way of labeling grids and provide that as an option.  Based on 
 previous conversations with Ryan McKinley and Chris Male, it seems the 
 Military Grid Reference System 
 (http://en.wikipedia.org/wiki/Military_grid_reference_system) is a good 
 candidate for the replacement due to its standard use of metric tiles of 
 increasing orders of magnitude (1, 10, 100, 1000, etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

2010-09-16 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910201#action_12910201
]

Jason Rutherglen commented on LUCENE-2575:
--

bq. we know what size (level + 1, ceiling'd) to make the next slice.

Thanks. In the midst of debugging last night I realized this. The next
question is whether to remove it.

Concurrent byte and int block implementations
-

Key: LUCENE-2575
URL: https://issues.apache.org/jira/browse/LUCENE-2575
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Fix For: Realtime Branch

Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch,
LUCENE-2575.patch

The current *BlockPool implementations aren't quite concurrent.
We really need something that has a locking flush method, where
flush is called at the end of adding a document. Once flushed,
the newly written data would be available to all other reading
threads (ie, postings etc). I'm not sure I understand the slices
concept, it seems like it'd be easier to implement a seekable
random access file like API. One'd seek to a given position,
then read or write from there. The underlying management of byte
arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2648) Allow PackedInts.ReaderIterator to advance more than one value

Allow PackedInts.ReaderIterator to advance more than one value
--

 Key: LUCENE-2648
 URL: https://issues.apache.org/jira/browse/LUCENE-2648
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor


The iterator-like API in LUCENE-2186 makes effective use of 
PackedInts.ReaderIterator but frequently skips multiple values. ReaderIterator 
currently requires to loop over ReaderInterator#next() to advance to a certain 
value. We should allow ReaderIterator to expose a #advance(ord) method to make 
use-cases like that more efficient. 

This issue is somewhat part of my efforts to make LUCENE-2186 smaller while 
breaking it up in little issues for parts which can be generally useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2648) Allow PackedInts.ReaderIterator to advance more than one value


 [ 
https://issues.apache.org/jira/browse/LUCENE-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2648:


Attachment: LUCENE-2648.patch

here is a patch - comments welcome

 Allow PackedInts.ReaderIterator to advance more than one value
 --

 Key: LUCENE-2648
 URL: https://issues.apache.org/jira/browse/LUCENE-2648
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
Priority: Minor
 Attachments: LUCENE-2648.patch


 The iterator-like API in LUCENE-2186 makes effective use of 
 PackedInts.ReaderIterator but frequently skips multiple values. 
 ReaderIterator currently requires to loop over ReaderInterator#next() to 
 advance to a certain value. We should allow ReaderIterator to expose a 
 #advance(ord) method to make use-cases like that more efficient. 
 This issue is somewhat part of my efforts to make LUCENE-2186 smaller while 
 breaking it up in little issues for parts which can be generally useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: trie* fields and sortMissingLast?

2010-09-16 Thread Yonik Seeley

On Thu, Sep 16, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote:
 (i changed the subject to see if Uwe perks up)

 Is it possible to change the FieldCache for Trie* fields so that it
 knows what fields are missing?  or is there something about the Trie
 structure that makes that impossible.

Nope - it is trivial to record that while the entry is being built for
all of the current FieldCache entry types - it's just not currently
done.  After it is recorded (via a bitset most likely), it needs to be
exposed via an API.

 It would be great to be able to deprecate sint,slong,sfloat,sdouble

+1

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910262#action_12910262
]

Michael McCandless commented on LUCENE-2324:

Is this near-comittable? Ie just the DWPT cutover? This part seems
separable from making each DWPT's buffer searchable?

I'm running some tests w/ 20 indexing threads and I think the sync'd flush is a
big bottleneck...

Per thread DocumentsWriters that write their own private segments
-

Key: LUCENE-2324
URL: https://issues.apache.org/jira/browse/LUCENE-2324
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: Realtime Branch

Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch

See LUCENE-2293 for motivation and more details.
I'm copying here Mike's summary he posted on 2293:
Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
normal segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO CPU.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-09-16 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910268#action_12910268
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

bq. I think the sync'd flush is a big bottleneck

Is this because indexing stops while the DWPT segment is being flushed to disk 
or are you referring to a different sync?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

[
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910276#action_12910276
]

Michael McCandless commented on LUCENE-2324:

bq. Is this because indexing stops while the DWPT segment is being flushed to
disk or are you referring to a different sync?

I'm talking about Lucene trunk today (ie before this patch).

Yes, because indexing of all 20 threads is blocked while a single thread moves
the RAM buffer to disk. But, with this patch, each thread will privately move
its own RAM buffer to disk, not blocking the rest.

With 20 threads I'm seeing ~4 seconds of concurrent indexing and then 6-8
seconds to flush (w/ 256 MB RAM buffer).

Per thread DocumentsWriters that write their own private segments
-

Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: trie* fields and sortMissingLast?

2010-09-16 Thread Ryan McKinley

On Thu, Sep 16, 2010 at 11:28 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Thu, Sep 16, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote:
 (i changed the subject to see if Uwe perks up)

 Is it possible to change the FieldCache for Trie* fields so that it
 knows what fields are missing?  or is there something about the Trie
 structure that makes that impossible.

 Nope - it is trivial to record that while the entry is being built for
 all of the current FieldCache entry types - it's just not currently
 done.  After it is recorded (via a bitset most likely), it needs to be
 exposed via an API.


Looking at the FieldCache (first time ever), I'm not sure I see an
obvious place to augment the cache with a BitSet for the matching
docs.

We could add a function to the FieldCache like:

  public BitSet getMatchingDocs(IndexReader reader, String field )

That would cache the matching docs for a field, however that means you
would have to traverse the terms twice.  The existing API for caching
values stores the values (short[], int[], etc) not the Entry, so
augmenting the cached Entry with a BitSet would get lost.

It seems that this could be done, but would require some rejiggering
to the API.  The API could return an object like:
class ByteValues {
  byte[] values;
  BitSet valid;
}

public ByteValues  getBytes (IndexReader reader, String field)

Another option (just brainstorming) would be to set the arrays to a
special value to say they are 'missing'  for example
Integer.MIN_VALUE.  The downside of this is that we lose one valid
value in the range.  For int, double, float, this may be OK, but for
byte and short this is a pretty big tradeoff.

Ideas for what may be a good path forward?

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (SOLR-1852) enablePositionIncrements=true can cause searches to fail when they are parsed as phrase queries

2010-09-16 Thread Mark Bennett (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910287#action_12910287
 ] 

Mark Bennett commented on SOLR-1852:


I realize this is closed, but I found a workaround for those who are still 
working with a pre-fix version.

Just put the stopwords filter after the Word Delimiter filter. That worked for 
us without impacting much else, until we can get over to the new version.

 enablePositionIncrements=true can cause searches to fail when they are 
 parsed as phrase queries
 -

 Key: SOLR-1852
 URL: https://issues.apache.org/jira/browse/SOLR-1852
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Peter Wolanin
Assignee: Robert Muir
 Fix For: 1.4.1

 Attachments: SOLR-1852.patch, SOLR-1852_solr14branch.patch, 
 SOLR-1852_testcase.patch


 Symptom: searching for a string like a domain name containing a '.', the Solr 
 1.4 analyzer tells me that I will get a match, but when I enter the search 
 either in the client or directly in Solr, the search fails. 
 test string:  Identi.ca
 queries that fail:  IdentiCa, Identi.ca, Identi-ca
 query that matches: Identi ca
 schema in use is:
 http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34content-type=text%2Fplainview=copathrev=DRUPAL-6--1
 Screen shots:
 analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
 dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
 dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
 standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png
 Whether or not the bug appears is determined by the surrounding text:
 would be great to have support for Identi.ca on the follow block
 fails to match Identi.ca, but putting the content on its own or in another 
 sentence:
 Support Identi.ca
 the search matches.  Testing suggests the word for is the problem, and it 
 looks like the bug occurs when a stop word preceeds a word that is split up 
 using the word delimiter filter.
 Setting enablePositionIncrements=false in the stop filter and reindexing 
 causes the searches to match.
 According to Mark Miller in #solr, this bug appears to be fixed already in 
 Solr trunk, either due to the upgraded lucene or changes to the 
 WordDelimiterFactory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Resolved: (SOLR-2064) Search Grouping: support highlighting

2010-09-16 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved SOLR-2064.


Fix Version/s: 4.0
   Resolution: Fixed

fix committed.

 Search Grouping: support highlighting
 -

 Key: SOLR-2064
 URL: https://issues.apache.org/jira/browse/SOLR-2064
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Fix For: 4.0


 Highlighting should be supported regardless of where the documents occur in a 
 response, and regardless of the format (grouped, standard, etc).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: trie* fields and sortMissingLast?

2010-09-16 Thread Uwe Schindler

Hi,

is there already an issue open for the Bits iunterface in parallel to the
native types arrays in FieldCache?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
 Seeley
 Sent: Thursday, September 16, 2010 11:28 AM
 To: dev@lucene.apache.org
 Subject: Re: trie* fields and sortMissingLast?
 
 On Thu, Sep 16, 2010 at 2:20 PM, Ryan McKinley ryan...@gmail.com wrote:
  (i changed the subject to see if Uwe perks up)
 
  Is it possible to change the FieldCache for Trie* fields so that it
  knows what fields are missing?  or is there something about the Trie
  structure that makes that impossible.
 
 Nope - it is trivial to record that while the entry is being built for all
of the
 current FieldCache entry types - it's just not currently done.  After it
is recorded
 (via a bitset most likely), it needs to be exposed via an API.
 
  It would be great to be able to deprecate
sint,slong,sfloat,sdouble
 
 +1
 
 -Yonik
 http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: trie* fields and sortMissingLast?

2010-09-16 Thread Uwe Schindler

 It seems that this could be done, but would require some rejiggering to
the API.
 The API could return an object like:
 class ByteValues {
   byte[] values;
   BitSet valid;
 }
 
 public ByteValues  getBytes (IndexReader reader, String field)

Thats the plan how to do it. Only replace BitSet by the Bits interface
(which is available in trunk). Bits is also implemented by OpenBitSet, so
the cache can be backed by OpenBitSet. You have to only consult terms one
time. Start with empty Bits and place a mark on each document that has a got
a value assigned.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2649) FieldCache should include a BitSet for matching docs

FieldCache should include a BitSet for matching docs


 Key: LUCENE-2649
 URL: https://issues.apache.org/jira/browse/LUCENE-2649
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Ryan McKinley
 Fix For: 4.0


The FieldCache returns an array representing the values for each doc.  However 
there is no way to know if the doc actually has a value.

This should be changed to return an object representing the values *and* a 
BitSet for all valid docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2649) FieldCache should include a BitSet for matching docs


[ 
https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910446#action_12910446
 ] 

Ryan McKinley commented on LUCENE-2649:
---

See some discussion here:
http://search.lucidimagination.com/search/document/b6a531f7b73621f1/trie_fields_and_sortmissinglast

 FieldCache should include a BitSet for matching docs
 

 Key: LUCENE-2649
 URL: https://issues.apache.org/jira/browse/LUCENE-2649
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Ryan McKinley
 Fix For: 4.0


 The FieldCache returns an array representing the values for each doc.  
 However there is no way to know if the doc actually has a value.
 This should be changed to return an object representing the values *and* a 
 BitSet for all valid docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2649) FieldCache should include a BitSet for matching docs


 [ 
https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated LUCENE-2649:
--

Attachment: LUCENE-2649-FieldCacheWithBitSet.patch

This patch replaces the cached primitive[] with a CachedObject.  The object 
heiarch looks like this:

{code:java}

public abstract static class CachedObject { 

  }

  public abstract static class CachedArray extends CachedObject {
public final Bits valid;
public CachedArray( Bits valid ) {
  this.valid = valid;
}
  };

  public static final class ByteValues extends CachedArray {
public final byte[] values;
public ByteValues( byte[] values, Bits valid ) {
  super( valid );
  this.values = values;
}
  };
  ...
{code}

Then this @deprecates the getBytes() classes and replaces them with 
getByteValues()

{code:java}

  public ByteValues getByteValues(IndexReader reader, String field)
  throws IOException;

  public ByteValues getByteValues(IndexReader reader, String field, ByteParser 
parser)
  throws IOException;
  
{code}

then repeat for all the other types!

All tests pass with this patch, but i have not added any tests for the BitSet 
(yet)

If people like the general look of this approach, I will clean it up and add 
some tests, javadoc cleanup etc


 FieldCache should include a BitSet for matching docs
 

 Key: LUCENE-2649
 URL: https://issues.apache.org/jira/browse/LUCENE-2649
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Ryan McKinley
 Fix For: 4.0

 Attachments: LUCENE-2649-FieldCacheWithBitSet.patch


 The FieldCache returns an array representing the values for each doc.  
 However there is no way to know if the doc actually has a value.
 This should be changed to return an object representing the values *and* a 
 BitSet for all valid docs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: trie* fields and sortMissingLast?

2010-09-16 Thread Ryan McKinley

I could not find anything similar in JIRA, and went ahead and implemented:
https://issues.apache.org/jira/browse/LUCENE-2649


On Thu, Sep 16, 2010 at 5:21 PM, Uwe Schindler u...@thetaphi.de wrote:
 It seems that this could be done, but would require some rejiggering to
 the API.
 The API could return an object like:
 class ByteValues {
   byte[] values;
   BitSet valid;
 }

 public ByteValues  getBytes (IndexReader reader, String field)

 Thats the plan how to do it. Only replace BitSet by the Bits interface
 (which is available in trunk). Bits is also implemented by OpenBitSet, so
 the cache can be backed by OpenBitSet. You have to only consult terms one
 time. Start with empty Bits and place a mark on each document that has a got
 a value assigned.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2649) FieldCache should include a BitSet for matching docs