[jira] [Commented] (LUCENE-5476) Facet sampling

2014-03-17 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937580#comment-13937580
 ] 

Gilad Barkai commented on LUCENE-5476:
--

About the scores (the only part I got to review thus far), the scores should be 
a non-sparse float array.
E,g, if there are 1M documents and the original set contains 1000 documents the 
score[] array would be of length 1000, If the sampled set will only have 10 
documents, the score[] array should be only 10.

The relevant part:
{code}
if (getKeepScores()) {
  scores[doc] = docs.scores[doc];
}
{code}
should be changed as the scores[] size and index should be relative to the 
sampled set and not the original results.
Also the size of the score[] array could be the amount of bins?

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 LUCENE-5476.patch, LUCENE-5476.patch, 
 SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5476) Facet sampling

2014-03-09 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13925158#comment-13925158
 ] 

Gilad Barkai commented on LUCENE-5476:
--

{quote}
The limit should also take under account the total number of hits for the 
query, otherwise the estimate and the multiplication with the sampling factor 
may yield a larger number than the actual results.
{quote}

I understand this statement is confusing, I'll try to elaborate.
If the sample was *exactly* at the sampling ratio, this would not be a problem, 
but since the sample - being random as it is - may be a bit larger, adjusting 
according to the original sampling ratio (rather than the actual one) may yield 
larger counts than the actual results. 
This could be solved by either limiting to the number of results, or adjusting 
the {{samplingRate}} to be the exact, post-sampling, ratio.

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5476) Facet sampling

2014-03-07 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13924348#comment-13924348
 ] 

Gilad Barkai commented on LUCENE-5476:
--

{quote}
Btw. Is there an easy way to retrieve the total facet counts for a ordinal? 
When correcting facet counts it would a quick win to limit the number of 
estimated documents to the actual number of documents in the index that match 
that facet. (And maybe use the distribution as well, to make better estimates)
{quote}

That's a great idea!

The {{docFreq}} of the category drill-down term is an upper bound - and could 
be used as a limit.
It's cheap, but might not be the exact number as it also take under account 
deleted documents.

The limit should also take under account the total number of hits for the 
query, otherwise the estimate and the multiplication with the sampling factor 
may yield a larger number than the actual results.

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5476) Facet sampling

2014-03-06 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922374#comment-13922374
 ] 

Gilad Barkai commented on LUCENE-5476:
--

Hi Rob, patch looks great.

A few comments:
* Some imports are not used (o.a.l.u.Bits, o.a.l.s.Collector  o.a.l.s.DocIdSet)
* Perhaps the parameters initialized in the RandomSamplingFacetsCollector c'tor 
could be made {{final}}
* XORShift64Random.XORShift64Random() (default c'tor) is never used. Perhaps it 
was needed for usability when this was thought to be a core utility and was 
left by mistake? Should it be called somewhere?
* {{getMatchingDocs()}}
** when {{!sampleNeeded()}} there's a call to {{super.getMatchingDocs()}}, this 
may be redundant method call as 5 lines above we call it, and the code always 
compute the {{totalHits}} first. Perhaps the original matching docs could be 
stored as a member? This would also help for some implementations of correcting 
the sampled facet results.
** {{totalHits}} is redundantly computed again in line 147-152 
* {{needsSampling()}} could perhaps be protected, allowing other criteria for 
sampling to be added
* {{createSample()}}
** {{randomIndex}} is initialized to {{0}}, effectively making the first 
document of every segment's bin to be selected as the representative of that 
bin, neglecting the rest of the bin (regardless of the seed). So if a bin is 
the size of a 1000 documents, than there are 999 documents that regardless of 
the seed would always be neglected. It may be better so initialize as 
{{randomIndex = random.nextInt(binsize)}} as it happens for the 2nd and on 
bins. 
** While creating a new {{MatchingDocs}} with the sampled set, the original 
{{totalHits}} and original {{scores}} are used. I'm not 100% sure the first is 
an issue, but any facet accumulation which would rely on document scores would 
be hit by the second as the {{scores}} (at least by javadocs) are defined as 
non-sparse.


 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5476) Facet sampling

2014-03-04 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13919380#comment-13919380
 ] 

Gilad Barkai commented on LUCENE-5476:
--

Sorry if this was mentioned before and I missed it - how can sampling work 
correctly (correctness of the end result) if it's done 'on the fly' ? 
Beforehand, one cannot know the number of documents that would match the query, 
as such, the sampling ratio is unknown, given that we can afford faceted search 
over N documents only.
If the query yields 10K results and the sampling ration is 0.001 - would 10 
documents make a good sample?
Same if the query yields 100M results - is 10K sample good enough? Is it 
perhaps to much?
I find it hard to figure a pre-defined sampling ratio which would fit different 
cases.

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
 LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, 
 SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5476) Facet sampling

2014-03-03 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5476:
-

Attachment: SamplingComparison_SamplingFacetsCollector.java

Nice patch,

I was surprised by the gain - only 3.5 time the gain for 1/1000 of the 
documents... I think this is because the random generation takes to long?

Perhaps it is to heavy to call {{.get(docId)}} on the original/clone and 
{{.clear()}} for every bit.

I crafted some code for creating a new (cleared) {{FixedBitSet}} instead of a 
clone, and setting only the sampled bits. With it, instead of going on ever 
document and calling {{.get(docId)}} I used the iterator, which may be more 
efficient when it comes to skipping cleared bits (deleted docs?).

Ran the code on a scenario as you described (Bitset sized at 10M, randomly 
selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 
documents).
The results:
* none-cloned + iterator: ~20ms
* cloned + for loop on each docId: ~50ms

{code}
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
final DocIdSetIterator it = docIdSet.iterator();

try {
  for( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; 
doc = it.nextDoc()) {
if (++countInBin == binsize) {
  countInBin = 0;
  randomIndex = random.nextInt(binsize);
}
if (countInBin == randomIndex) {
  sampledBits.set(doc);
}
  }
} catch (IOException e) {
  // should not happen
  throw new RuntimeException(e);
}
{code}

Also attaching the java file with a main() which compares the two 
implementations (original still named {{createSample()}} and the proposed one 
{{createSample2()}}. 

Hopefully, if this will be consistent, the sampling end-to-end should be closer 
the 10 factor gain as hoped for.

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, LUCENE-5476.patch, 
 SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5476) Facet sampling

2014-03-01 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917158#comment-13917158
 ] 

Gilad Barkai commented on LUCENE-5476:
--

Great effort!

I wish to through in another part - the description of this issue is about 
sampling, but the implementation is about *random* sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times 
Random.nextInt would be measurable by itself IMHO).
A different sample could be
{{code}}
int acceptedModulu = (int)(1/sampleRatio);

int next() {
  do {
nextDoc = inner.next();
  } while (nextDoc != NO_MORE_DOCX  nextDoc % acceptedModulu != 0) ;

  return nextDoc;
}
{{code}}

This should be faster as a sampler, and perhaps saves us from creating a new 
{DocIdSet}.

One last thing - if I did the math right - the sample crafted by the code in 
the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the 
avg. jump is actually 5 - and every 5th document in the original set (again, 
in avg) would be selected, and not every 10th in avg. I think the 
random.nextInt should be called with twice the size it is called now (e.g 20, 
making the avg random selection 10).

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5476) Facet sampling

2014-03-01 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917158#comment-13917158
 ] 

Gilad Barkai edited comment on LUCENE-5476 at 3/1/14 7:02 PM:
--

Great effort!

I wish to through in another part - the description of this issue is about 
sampling, but the implementation is about *random* sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times 
Random.nextInt would be measurable by itself IMHO).
A different sample could be (pseudo-code)
{code}
int acceptedModulu = (int)(1/sampleRatio);

int next() {
  do {
nextDoc = inner.next();
  } while (nextDoc != NO_MORE_DOCX  nextDoc % acceptedModulu != 0) ;

  return nextDoc;
}
{code}

This should be faster as a sampler, and perhaps saves us from creating a new 
{{DocIdSet}}.

One last thing - if I did the math right - the sample crafted by the code in 
the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the 
avg. jump is actually 5 - and every 5th document in the original set (again, 
in avg) would be selected, and not every 10th in avg. I think the 
random.nextInt should be called with twice the size it is called now (e.g 20, 
making the avg random selection 10).


was (Author: gilad):
Great effort!

I wish to through in another part - the description of this issue is about 
sampling, but the implementation is about *random* sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times 
Random.nextInt would be measurable by itself IMHO).
A different sample could be
{{code}}
int acceptedModulu = (int)(1/sampleRatio);

int next() {
  do {
nextDoc = inner.next();
  } while (nextDoc != NO_MORE_DOCX  nextDoc % acceptedModulu != 0) ;

  return nextDoc;
}
{{code}}

This should be faster as a sampler, and perhaps saves us from creating a new 
{DocIdSet}.

One last thing - if I did the math right - the sample crafted by the code in 
the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the 
avg. jump is actually 5 - and every 5th document in the original set (again, 
in avg) would be selected, and not every 10th in avg. I think the 
random.nextInt should be called with twice the size it is called now (e.g 20, 
making the avg random selection 10).

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5476) Facet sampling

2014-03-01 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917158#comment-13917158
 ] 

Gilad Barkai edited comment on LUCENE-5476 at 3/1/14 7:03 PM:
--

Great effort!

I wish to throw in another part - the description of this issue is about 
sampling, but the implementation is about *random* sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times 
Random.nextInt would be measurable by itself IMHO).
A different sample could be (pseudo-code)
{code}
int acceptedModulu = (int)(1/sampleRatio);

int next() {
  do {
nextDoc = inner.next();
  } while (nextDoc != NO_MORE_DOCX  nextDoc % acceptedModulu != 0) ;

  return nextDoc;
}
{code}

This should be faster as a sampler, and perhaps saves us from creating a new 
{{DocIdSet}}.

One last thing - if I did the math right - the sample crafted by the code in 
the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the 
avg. jump is actually 5 - and every 5th document in the original set (again, 
in avg) would be selected, and not every 10th in avg. I think the 
random.nextInt should be called with twice the size it is called now (e.g 20, 
making the avg random selection 10).


was (Author: gilad):
Great effort!

I wish to through in another part - the description of this issue is about 
sampling, but the implementation is about *random* sampling.
This is not always the case, nor it is very fast (indeed, calling 1M times 
Random.nextInt would be measurable by itself IMHO).
A different sample could be (pseudo-code)
{code}
int acceptedModulu = (int)(1/sampleRatio);

int next() {
  do {
nextDoc = inner.next();
  } while (nextDoc != NO_MORE_DOCX  nextDoc % acceptedModulu != 0) ;

  return nextDoc;
}
{code}

This should be faster as a sampler, and perhaps saves us from creating a new 
{{DocIdSet}}.

One last thing - if I did the math right - the sample crafted by the code in 
the patch would be twice as large as the user may expect.
For a sample ratio of 0.1, the random.nextInt() would be called with 10, so the 
avg. jump is actually 5 - and every 5th document in the original set (again, 
in avg) would be selected, and not every 10th in avg. I think the 
random.nextInt should be called with twice the size it is called now (e.g 20, 
making the avg random selection 10).

 Facet sampling
 --

 Key: LUCENE-5476
 URL: https://issues.apache.org/jira/browse/LUCENE-5476
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Rob Audenaerde
 Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java


 With LUCENE-5339 facet sampling disappeared. 
 When trying to display facet counts on large datasets (10M documents) 
 counting facets is rather expensive, as all the hits are collected and 
 processed. 
 Sampling greatly reduced this and thus provided a nice speedup. Could it be 
 brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5457) Expose SloppyMath earth diameter table

2014-02-19 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905736#comment-13905736
 ] 

Gilad Barkai commented on LUCENE-5457:
--

+1 
Sorry for the confusion, indeed it should be diameter as the multiplication 
(*2) was moved to the pre-computed table, hence saving the operation in runtime 
as per Ryan's comment.

 Expose SloppyMath earth diameter table
 --

 Key: LUCENE-5457
 URL: https://issues.apache.org/jira/browse/LUCENE-5457
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Fix For: 4.7

 Attachments: LUCENE-5457.patch


 LUCENE-5271 introduced a table in order to get approximate values of the 
 diameter of the earth given a latitude. This could be useful for other 
 computations so I think it would be nice to have a method that exposes this 
 table.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5271) A slightly more accurate SloppyMath distance

2013-12-08 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5271:
-

Attachment: LUCENE-5271.patch

Ryan, thanks for looking at this.

bq. If the lat/lon values are large, then the index would be out of bounds for 
the table
Nice catch! I did not check for values over 90 degs Lat. Added a % with the the 
table's size.

bq. Why was this test removed? assertEquals(314.40338, haversin(1, 2, 3, 4), 
10e-5)
Well the test's result are wrong :) The new more accurate method gets other 
results.  I added other test instead:
{code}
double earthRadiusKMs = 6378.137;
double halfCircle = earthRadiusKMs * Math.PI;
assertEquals(halfCircle, haversin(0, 0, 0, 180), 0D);
{code}
Which computes half earth circle on the equator using both the harvestin and a 
simple circle equation using Earth's equator radius.
It differs in over 20KMs from the old harvesin result btw.

bq. Could you move the 2 * radius computation into the table?
Awesome! renamed the table to diameter rather than radius. 

bq. I know this is an already existing problem, but could you move the division 
by 2 from h1/h2 to h?
Done.

 A slightly more accurate SloppyMath distance
 

 Key: LUCENE-5271
 URL: https://issues.apache.org/jira/browse/LUCENE-5271
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/other
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5271.patch, LUCENE-5271.patch, LUCENE-5271.patch


 SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) 
 ellipsoid radius as an approximation for computing the spherical distance. 
 (The TO_KILOMETERS constant).
 While this is pretty accurate for long distances (latitude wise) this may 
 introduce some small errors while computing distances close to the equator 
 (as the earth radius there is larger than the avg.)
 A more accurate approximation would be taking the avg. earth radius at the 
 source and destination points. But computing an ellipsoid radius at any given 
 point is a heavy function, and this distance should be used in a scoring 
 function.. So two optimizations are optional - 
 * Pre-compute a table with an earth radius per latitude (the longitude does 
 not affect the radius)
 * Instead of using two point radius avg, figure out the avg. latitude 
 (exactly between the src and dst points) and get its radius.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

2013-11-27 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13833596#comment-13833596
 ] 

Gilad Barkai commented on LUCENE-5339:
--

Been away from the issue for some time and it looks like a major progress,  
Chapeau à lui

{{LabelAndValue}}  {{FacetResult}} use {{instanceof}} checks in their 
{{equals}} method - is that a must?

{{FacetResult}} has a member called {{childCount}} - I think it's the number of 
categories/path/labels that were encountered. The current jdocs How many 
labels were populated under the requested path reveals implementation 
(population). Perhaps exchange populated with encountered?

{{FloatRange}} and {{DoubleRange}} uses {{Math.nextUp/Down}} for infinity as 
the ranges are always inclusive. Perhaps these constants for float and double 
could be static final. 

{{TaxonomyFacetSumFloatAssociations}} and {{TaxonomyFacetSumValueSource}} reuse 
a LOT of code, can they extend one another? perhaps extract a common super for 
both?

In {{TaxonomyFacets}} the parents array is saves, I could not see where it's 
being used (and I think it's not used even in the older taxonomy-facet 
implementation). 

{{FacetConfig}} confuses me a bit, as it's very much aware of the Taxonomy, on 
another it handles all the kinds of the facets.
Perhaps {{FacetConfig.build()}} could be split up, allowing each 
{{FacetField.Type}} a build() method of its own, rather than every types' 
building being done in the same method. It will also bring a common parent 
class to all FacetField types, which I also like. As such, the taxonomy part, 
with {{processFacetFields()}} could be moved to its respective Facet 
implementation.
  

 Simplify the facet module APIs
 --

 Key: LUCENE-5339
 URL: https://issues.apache.org/jira/browse/LUCENE-5339
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5339.patch, LUCENE-5339.patch


 I'd like to explore simplifications to the facet module's APIs: I
 think the current APIs are complex, and the addition of a new feature
 (sparse faceting, LUCENE-5333) threatens to add even more classes
 (e.g., FacetRequestBuilder).  I think we can do better.
 So, I've been prototyping some drastic changes; this is very
 early/exploratory and I'm not sure where it'll wind up but I think the
 new approach shows promise.
 The big changes are:
   * Instead of *FacetRequest/Params/Result, you directly instantiate
 the classes that do facet counting (currently TaxonomyFacetCounts,
 RangeFacetCounts or SortedSetDVFacetCounts), passing in the
 SimpleFacetsCollector, and then you interact with those classes to
 pull labels + values (topN under a path, sparse, specific labels).
   * At index time, no more FacetIndexingParams/CategoryListParams;
 instead, you make a new SimpleFacetFields and pass it the field it
 should store facets + drill downs under.  If you want more than
 one CLI you create more than one instance of SimpleFacetFields.
   * I added a simple schema, where you state which dimensions are
 hierarchical or multi-valued.  From this we decide how to index
 the ordinals (no more OrdinalPolicy).
 Sparse faceting is just another method (getAllDims), on both taxonomy
  ssdv facet classes.
 I haven't created a common base class / interface for all of the
 search-time facet classes, but I think this may be possible/clean, and
 perhaps useful for drill sideways.
 All the new classes are under oal.facet.simple.*.
 Lots of things that don't work yet: drill sideways, complements,
 associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-23 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830884#comment-13830884
 ] 

Gilad Barkai commented on LUCENE-5316:
--

I think we don't have to wait for LUCENE-5339, we could handle some of these 
issues now.

* {{if (kids == null)}} and hashing - there's a solution to that: not hold it 
in a map. We could hold it in an {{int[][]}} and NO_CHILDREN for the entries 
which has no children. So we'll have one array at the size of the taxonomy, and 
the sum of others will be significantly smaller if we have flat dimensions. 
We'll lose some RAM compared to the Map, but it will speed things up because 
we'll simply return what's in {{children\[ord\]}}. 

* {{if (children == null)}} is done only because it's being allocated lazily. 
Does it make sense to keep it that way? We could compute the children upon 
taxonomy opening and be done with it. Today (trunk) reopen is very costly as 
it's in the O(TaxonomySize) which affects NRT reopen (it's not guaranteed that 
for each reopen someone will actually need the new facet information), , but if 
the computation of the children is done according to the O(new segments) than 
we're not wasting much. 

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, 
 LUCENE-5316.patch, LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-17 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5316:
-

Attachment: LUCENE-5316.patch

Changes:
* Fixed concurrency issues
* Moved to an {{int[] getChildren(int ord))}} API which should perform at least 
as fast as the taxonomy-arrays way.

The code is not ready for commit:
* Children-map does not support fast reopen (using an older map and only 
updating with newly added ordinals)
* Children-map initialization still uses the taxonomy arrays, so they are not 
completely removed.

Should a benchmark show this change performs better I'll make the extra changes 
for making it commit-ready.

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, 
 LUCENE-5316.patch, LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5271) A slightly more accurate SloppyMath distance

2013-11-17 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5271:
-

Attachment: LUCENE-5271.patch

Adapted tests to the more accurate distance method.
Patch is ready.

 A slightly more accurate SloppyMath distance
 

 Key: LUCENE-5271
 URL: https://issues.apache.org/jira/browse/LUCENE-5271
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/other
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5271.patch, LUCENE-5271.patch


 SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) 
 ellipsoid radius as an approximation for computing the spherical distance. 
 (The TO_KILOMETERS constant).
 While this is pretty accurate for long distances (latitude wise) this may 
 introduce some small errors while computing distances close to the equator 
 (as the earth radius there is larger than the avg.)
 A more accurate approximation would be taking the avg. earth radius at the 
 source and destination points. But computing an ellipsoid radius at any given 
 point is a heavy function, and this distance should be used in a scoring 
 function.. So two optimizations are optional - 
 * Pre-compute a table with an earth radius per latitude (the longitude does 
 not affect the radius)
 * Instead of using two point radius avg, figure out the avg. latitude 
 (exactly between the src and dst points) and get its radius.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

2013-11-13 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13821142#comment-13821142
 ] 

Gilad Barkai commented on LUCENE-5339:
--

Mike, 
the idea of simplifying the API sounds great, but is it really that complected 
now?

Facet's {{Accumulator}} is similar to Lucene's {{Collector}}, the 
{{Aggregator}} is sort of a {{Scorer}}, and a {{FacetRequest}} is a sort of 
{{Query}}.
Actually the model after which the facets were designed was Lucene's.
The optional {{IndexingParams}} came before the {{IndexWriterConfig}} but these 
can be said to be similar as well.

More low-level objects such as the {{CategoryListParams}} are not a must, and 
the user may never know about them (and btw, they are similar to {{Codecs}}).

I reviewed the patch (mostly the taxonomy related part) and I think that even 
without associations, counts only is a bit narrow.
Specially with large counts (say many thousands) the count doesn't say much 
because of the long tail problem.
When there's a large result set, all the categories will get high hit counts. 
And just as scoring by counting the number of query terms each document matches 
doesn't always make much sense (and I think all scoring functions do things a 
lot smarter), using counts for facets may at times yield irrelevant results.

We found out that for large result sets, an aggregation of Lucene's score 
(rather than {{+1}}),  or even score^2 yields better results for the user. Also 
arbitrary expressions which are corpus specific (with or without associations) 
changes the facets' usability dramatically. That's partially why the code was 
built to allow different aggregation techniques, allowing associations, 
numeric values etc into each value for each category.

As for the new API, it may be useful if there would be a single interface - 
so all facets implementations could be switched easily, allowing users to 
experiment with the different implementations without writing a lot of code. 

Bottom line, I'm all for simplifying the API but the current cost seems to 
great, and I'm not sure the benefits are proportional :)



 Simplify the facet module APIs
 --

 Key: LUCENE-5339
 URL: https://issues.apache.org/jira/browse/LUCENE-5339
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-5339.patch


 I'd like to explore simplifications to the facet module's APIs: I
 think the current APIs are complex, and the addition of a new feature
 (sparse faceting, LUCENE-5333) threatens to add even more classes
 (e.g., FacetRequestBuilder).  I think we can do better.
 So, I've been prototyping some drastic changes; this is very
 early/exploratory and I'm not sure where it'll wind up but I think the
 new approach shows promise.
 The big changes are:
   * Instead of *FacetRequest/Params/Result, you directly instantiate
 the classes that do facet counting (currently TaxonomyFacetCounts,
 RangeFacetCounts or SortedSetDVFacetCounts), passing in the
 SimpleFacetsCollector, and then you interact with those classes to
 pull labels + values (topN under a path, sparse, specific labels).
   * At index time, no more FacetIndexingParams/CategoryListParams;
 instead, you make a new SimpleFacetFields and pass it the field it
 should store facets + drill downs under.  If you want more than
 one CLI you create more than one instance of SimpleFacetFields.
   * I added a simple schema, where you state which dimensions are
 hierarchical or multi-valued.  From this we decide how to index
 the ordinals (no more OrdinalPolicy).
 Sparse faceting is just another method (getAllDims), on both taxonomy
  ssdv facet classes.
 I haven't created a common base class / interface for all of the
 search-time facet classes, but I think this may be possible/clean, and
 perhaps useful for drill sideways.
 All the new classes are under oal.facet.simple.*.
 Lots of things that don't work yet: drill sideways, complements,
 associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-13 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5316:
-

Attachment: LUCENE-5316.patch

Attaching a fixed patch with the concurrency resolved. 
This is not yet the patch with the {{int[]}} instead of an array.

I'm not sure that following the path of the int[] is right, it will block all 
future extensions for using e.g on-disk children.. 
What do you think? Perhaps it is better to keep the iterator?

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch, 
 LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-07 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816183#comment-13816183
 ] 

Gilad Barkai commented on LUCENE-5316:
--

That's... not an optimistic result :)
I think there should not be any MT issues? The initialization occurs while the 
taxonomy is being opened, and later on it's just read operations.. 
Perhaps this API change should be modifies, and we'd try Shai's approach - 
instead of a {{ChildrenIterator}} with .next() method, return an {{int[]}}.
The Map is still applicable for this approach, just that the int[] will not be 
wrapped with an iterator object.
I'll also review the patch, try to figure out what MT issues I missed.

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-05 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5316:
-

Attachment: LUCENE-5316.patch

Introducing a map-based children iterator, which holds for every ordinal (real 
parent) an {{int[]}} containing its direct children.

Each such {{int[]}} has an extra last slot for 
{{TaxonomyReader.INVALID_ORDINAL}} which spares an {{if}} call for every 
{{ChildrenIterator.next()}} call. 

This is a quick and dirty patch, just so we could verify the penalty for moving 
to the API/map is not great. If it is great, the whole issue should be 
reconsidered, and perhaps a move to direct {{int[]}} for iterating over 
children should be implemented.

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch, LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-03 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5316:
-

Attachment: LUCENE-5316.patch

Updated version, iterator now returned as {{null}} if ordinal has no children.

I think there are further improvements (a few {{if}}s and a loop check which 
could be avoided) but at least currently the .next() call is avoided as per 
Mike's suggestion. Hope this speed things up a bit.

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

2013-11-03 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812297#comment-13812297
 ] 

Gilad Barkai commented on LUCENE-5316:
--

The patch is not ready for commit - I just got a stackoverflow in my head while 
chasing this NPEs in the loop-as-recursion thing... 
Shai, you are right in the sense that we could avoid an extra loop/recursion if 
the kids are null, but this now works, doesn't throw NPEs in weird places and 
allows returning {{null}} while not calling the extra {{CI.next()}}.
I'm still looking into that. 
I touched the recurring loop as little as possible, as I'm not 100% sure I 
understand it fully (also, changing a small innocent thing breaks things bad). 
In the old code there were hidden assumptions which are now no longer true, I 
traced some but not sure about others.

bq.  how about if we consolidate getPTA() and getChildren() into a single 
getTaxonomyTree()
As for the API change - getPTA is being quietly - though with determination - 
swapped off. getChildren is all that is left.
I'm not sure .getChildren() should be encapsulated with a TaxononyTree object 
if it's the only API.
Do you think other API should be put there as well?

bq. I think that we should experiment here with a TaxoTree object which holds 
the children in a map
The current issue is not yet done, there are still tests to consider and hiding 
PTA even further from the users' eyes.
I feel that this issue is big enough to stand for itself, while the children 
map is also large enough? It would contain code for the iterations but also 
some code in the taxo-reopen phase, that will replace (hopefully ) the PTA 
reopen code.


 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch, LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

2013-10-31 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810087#comment-13810087
 ] 

Gilad Barkai commented on LUCENE-5316:
--

I like the {{null}} for ordinals with no siblings. Made the change, and now I'm 
chasing all the NPEs that it caused, hope to get a new patch up and about soon.

As for allowing the taxo to say the depth for each dim - that's more trickie. 
Obviously, that's a temporal state, as ever flat dimension can become non flat. 
Also figuring this our during search (more like, once per opening the 
taxo-reader) is o(taxoSize) at the current implementation.

Perhaps, if we're willing to invest a little more time during indexing, we 
could roll up and tell the parents (say, in an incremental numeric field 
update) how low can you go with its children?
In such a case, we could benefit not only from a flat dimension, but whenever 
an ordinal has no grandchildren. Investing during indexing will make it an O(1) 
operation.

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5316) Taxonomy tree traversing improvement

2013-10-30 Thread Gilad Barkai (JIRA)
Gilad Barkai created LUCENE-5316:


 Summary: Taxonomy tree traversing improvement
 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor


The taxonomy traversing is done today utilizing the {{ParallelTaxonomyArrays}}. 
In particular, two taxonomy-size {{int}} arrays which hold for each ordinal 
it's (array #1) youngest child and (array #2) older sibling.

This is a compact way of holding the tree information in memory, but it's not 
perfect:
* Large (8 bytes per ordinal in memory)
* Exposes internal implementation
* Utilizing these arrays for tree traversing is not straight forward
* Lose reference locality while traversing (the array is accessed in increasing 
only entries, but they may be distant from one another)
* In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)

This issue is about making the traversing more easy, the code more readable, 
and open it for future improvements (i.e memory footprint and NRT cost) - 
without changing any of the internals. 
A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

2013-10-30 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808968#comment-13808968
 ] 

Gilad Barkai commented on LUCENE-5316:
--

Thanks Mike, you're right on the money.

The ram consumption is indeed an issue here.
I'm not sure that the parent array is used during search at all... and perhaps 
could be removed (looking into that one as well).
The youngestChild/olderSibing arrays should and could be replaces with either a 
more compact RAM representation, or at extreme cases, even on-disk.

For a better RAM representation, the idea of a map from ord - int[] of it's 
children is a start.
In such a case, we benefit from not holding a 'youngerChild' int for each 
ordinal in the flat dimension - as they have no children.
2nd, we could benefit from the locality of ref, as all the children are near by 
and not spread over an array of millions. The non-huge-flat dimensions will no 
longer suffer because of the other dimensions.
Also, it would make the worst case of NRT the same as the current update 
(O(Taxo-size)) but might be very small if only a few ordinals were added, as 
only their 'family' would be reallocated and managed, rather than the entire 
array.

At a further phase - a compression could be allowed into that int[] of children 
- we know the children are in ascending order, and could only encode the DGaps, 
figure the largest DGAP and use packed ints instead of the int[]. It would add 
some (I hope) minor CPU consumption to the loop, but would benefit greatly when 
it comes to RAM consumption. 
I hope that all the logic could be encapsulated in the {{ChildrenIterator}} and 
the user will benefit from a clean API and better RAM utilization.

I'll post a patch shortly, which covers the very first part - hiding the 
implementation detail of children arrays (making 
TaxoReader.getParallelTaxoArrays protected to begin with), and moving 
{{TopKFacetResultHandler}} to use {{ChildrenIterator}}. 

Currently debugging some nasty loop-as-a-recursion related bug :)

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor

 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5316) Taxonomy tree traversing improvement

2013-10-30 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5316:
-

Attachment: LUCENE-5316.patch

{{TaxonomyReader.getParallelTaxonomyArrays}} is now protected, the 
implementation is only on {{DirectoryTaxonomyReader}} in which it is protected 
as well.

Parallel arrays are only used in tests ATM - all tree traversing is done using 
{{TaxonomyReader.getChildre(int ordinal)}} which is now abstract, and 
implemented in DirTaxoReader.

Mike, if you could please run this patch against the benchmarking maching it 
would be awesome - as the direct array access is now switched with a method 
call (iterator's {{.next()}}

I hope we will not see any significant degradation.

 Taxonomy tree traversing improvement
 

 Key: LUCENE-5316
 URL: https://issues.apache.org/jira/browse/LUCENE-5316
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5316.patch


 The taxonomy traversing is done today utilizing the 
 {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
 which hold for each ordinal it's (array #1) youngest child and (array #2) 
 older sibling.
 This is a compact way of holding the tree information in memory, but it's not 
 perfect:
 * Large (8 bytes per ordinal in memory)
 * Exposes internal implementation
 * Utilizing these arrays for tree traversing is not straight forward
 * Lose reference locality while traversing (the array is accessed in 
 increasing only entries, but they may be distant from one another)
 * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
 This issue is about making the traversing more easy, the code more readable, 
 and open it for future improvements (i.e memory footprint and NRT cost) - 
 without changing any of the internals. 
 A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5271) A slightly more accurate SloppyMath distance

2013-10-09 Thread Gilad Barkai (JIRA)
Gilad Barkai created LUCENE-5271:


 Summary: A slightly more accurate SloppyMath distance
 Key: LUCENE-5271
 URL: https://issues.apache.org/jira/browse/LUCENE-5271
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/other
Reporter: Gilad Barkai
Priority: Minor


SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) 
ellipsoid radius as an approximation for computing the spherical distance. 
(The TO_KILOMETERS constant).

While this is pretty accurate for long distances (latitude wise) this may 
introduce some small errors while computing distances close to the equator (as 
the earth radius there is larger than the avg.)

A more accurate approximation would be taking the avg. earth radius at the 
source and destination points. But computing an ellipsoid radius at any given 
point is a heavy function, and this distance should be used in a scoring 
function.. So two optimizations are optional - 
* Pre-compute a table with an earth radius per latitude (the longitude does not 
affect the radius)
* Instead of using two point radius avg, figure out the avg. latitude (exactly 
between the src and dst points) and get its radius.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5271) A slightly more accurate SloppyMath distance

2013-10-09 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5271:
-

Attachment: LUCENE-5271.patch

A proposed solution as per described.
Keep in mind that this is _not_ ready for commit as it breaks one of the tests 
derived for ES/Solr (as a result of the improved accuracy).

 A slightly more accurate SloppyMath distance
 

 Key: LUCENE-5271
 URL: https://issues.apache.org/jira/browse/LUCENE-5271
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/other
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5271.patch


 SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) 
 ellipsoid radius as an approximation for computing the spherical distance. 
 (The TO_KILOMETERS constant).
 While this is pretty accurate for long distances (latitude wise) this may 
 introduce some small errors while computing distances close to the equator 
 (as the earth radius there is larger than the avg.)
 A more accurate approximation would be taking the avg. earth radius at the 
 source and destination points. But computing an ellipsoid radius at any given 
 point is a heavy function, and this distance should be used in a scoring 
 function.. So two optimizations are optional - 
 * Pre-compute a table with an earth radius per latitude (the longitude does 
 not affect the radius)
 * Instead of using two point radius avg, figure out the avg. latitude 
 (exactly between the src and dst points) and get its radius.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5271) A slightly more accurate SloppyMath distance

2013-10-09 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13791119#comment-13791119
 ] 

Gilad Barkai edited comment on LUCENE-5271 at 10/10/13 2:30 AM:


A proposed solution as per described.
Please note it is _not_ ready for commit as it breaks one of the tests derived 
for ES/Solr (as a result of the improved accuracy).


was (Author: gilad):
A proposed solution as per described.
Keep in mind that this is _not_ ready for commit as it breaks one of the tests 
derived for ES/Solr (as a result of the improved accuracy).

 A slightly more accurate SloppyMath distance
 

 Key: LUCENE-5271
 URL: https://issues.apache.org/jira/browse/LUCENE-5271
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/other
Reporter: Gilad Barkai
Priority: Minor
 Attachments: LUCENE-5271.patch


 SloppyMath, intriduced in LUCENE-5258, uses earth's avg. (according to WGS84) 
 ellipsoid radius as an approximation for computing the spherical distance. 
 (The TO_KILOMETERS constant).
 While this is pretty accurate for long distances (latitude wise) this may 
 introduce some small errors while computing distances close to the equator 
 (as the earth radius there is larger than the avg.)
 A more accurate approximation would be taking the avg. earth radius at the 
 source and destination points. But computing an ellipsoid radius at any given 
 point is a heavy function, and this distance should be used in a scoring 
 function.. So two optimizations are optional - 
 * Pre-compute a table with an earth radius per latitude (the longitude does 
 not affect the radius)
 * Instead of using two point radius avg, figure out the avg. latitude 
 (exactly between the src and dst points) and get its radius.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5155) Add OrdinalValueResolver in favor of FacetRequest.getValueOf

2013-08-01 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726262#comment-13726262
 ] 

Gilad Barkai commented on LUCENE-5155:
--

Patch looks good.
+1 for commit.

Perhaps also document that FRNode is now comparable?


 Add OrdinalValueResolver in favor of FacetRequest.getValueOf
 

 Key: LUCENE-5155
 URL: https://issues.apache.org/jira/browse/LUCENE-5155
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-5155.patch


 FacetRequest.getValueOf is responsible for resolving an ordinal's value. It 
 is given FacetArrays, and typically does something like 
 {{arrays.getIntArray()[ord]}} -- for every ordinal! The purpose of this 
 method is to allow special requests, e.g. average, to do some post processing 
 on the values, that couldn't be done during aggregation.
 I feel that getValueOf is in the wrong place -- the calls to 
 getInt/FloatArray are really redundant. Also, if an aggregator maintains some 
 statistics by which it needs to correct the aggregated values, it's not 
 trivial to pass it from the aggregator to the request.
 Therefore I would like to make the following changes:
 * Remove FacetRequest.getValueOf and .getFacetArraysSource
 * Add FacetsAggregator.createOrdinalValueResolver which takes the FacetArrays 
 and has a simple API .valueOf(ordinal).
 * Modify the FacetResultHandlers to use OrdValResolver.
 This allows an OVR to initialize the right array instance(s) in the ctor, and 
 return the value of the requested ordinal, without doing arrays.getArray() 
 calls.
 Will post a patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5016) Sampling can break FacetResult labeling

2013-05-30 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670200#comment-13670200
 ] 

Gilad Barkai commented on LUCENE-5016:
--

Patch looks good.
+1 for commit 

 Sampling can break FacetResult labeling 
 

 Key: LUCENE-5016
 URL: https://issues.apache.org/jira/browse/LUCENE-5016
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Assignee: Shai Erera
Priority: Minor
 Attachments: LUCENE-5016.patch, test-labels.zip


 When sampling FacetResults, the TopKInEachNodeHandler is used to get the 
 FacetResults.
 This is my case:
 A FacetResult is returned (which matches a FacetRequest) from the 
 StandardFacetAccumulator. The facet has 0 results. The labelling of the 
 root-node seems incorrect. I know, from the StandardFacetAccumulator, that 
 the rootnode has a label, so I can use that one.
 Currently the recursivelyLabel method uses the taxonomyReader.getPath() to 
 retrieve the label. I think we can skip that for the rootNode when there are 
 no children (and gain a little performance on the way too?)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator

2013-05-26 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5015:
-

Attachment: LUCENE-5015.patch

True, looking at overSampleFactor is enough, but it's not obvious that 
TakmiFixer should be used with overSampleFactor  1, to better the chances of 
the result top-k being accurate.
I'll add some documentation w.r.t this issue, I hope it will do.

New patch defaults to {{NoopSampleFixer}} which does not touch the results at 
all - if the need is only for a top-k and their counts does not matter, this is 
the least expensive one. 
Also if instead of counts, a percentage sould be displayed (as how much of the 
results match this category), the sampled valued out of the sample size would 
yield the same result as the amortized fixed results out of the actual result 
set size. That might render the amortized fixer moot..

New patch account of {{SampleFixer}} being set in {{SamplingParams}}

 Unexpected performance difference between SamplingAccumulator and 
 StandardFacetAccumulator
 --

 Key: LUCENE-5015
 URL: https://issues.apache.org/jira/browse/LUCENE-5015
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Assignee: Shai Erera
Priority: Minor
 Attachments: LUCENE-5015.patch, LUCENE-5015.patch, LUCENE-5015.patch, 
 LUCENE-5015.patch


 I have an unexpected performance difference between the SamplingAccumulator 
 and the StandardFacetAccumulator. 
 The case is an index with about 5M documents and each document containing 
 about 10 fields. I created a facet on each of those fields. When searching to 
 retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
 about twice as fast as the StandardFacetAccumulator. This is expected and a 
 nice speed-up. 
 However, when I use more CountFacetRequests to retrieve facet-counts for more 
 than one field, the speeds of the SampingAccumulator decreases, to the point 
 where the StandardFacetAccumulator is faster. 
 {noformat} 
 FacetRequests  SamplingStandard
  1   391 ms 1100 ms
  2   531 ms 1095 ms 
  3   948 ms 1108 ms
  4  1400 ms 1110 ms
  5  1901 ms 1102 ms
 {noformat} 
 Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
 needs to do less work? 
 Some code to show what I do:
 {code}
   searcher.search( facetsQuery, facetsCollector );
   final ListFacetResult collectedFacets = 
 facetsCollector.getFacetResults();
 {code}
 {code}
 final FacetSearchParams facetSearchParams = new FacetSearchParams( 
 facetRequests );
 FacetsCollector facetsCollector;
 if ( isSampled )
 {
   facetsCollector =
   FacetsCollector.create( new SamplingAccumulator( new 
 RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
 }
 else
 {
   facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
 facetSearchParams, searcher.getIndexReader(), taxo ) );
 {code}
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator

2013-05-26 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5015:
-

Attachment: LUCENE-5015.patch

Shai, I think you're right, a null {{SampleFixer}} makes more sense. 

While working on a test which validates that a flow works with the {{null}} 
fixer, I found it it did not. The reason is Complements. By default the 
complements kicks in when enough results are found. I think this may hold the 
key to the performance differences as well.

Rod, could you please try the following code and report the results?

{code}
SamplingAccumulator accumulator = new SamplingAccumulator( new 
RandomSampler(),  facetSearchParams, searcher.getIndexReader, taxo);

// Make sure no complements are in action

accumulator.setComplementThreshold(StandardFacetsAccumulator.DISABLE_COMPLEMENT);

facetsCollector = FacetsCollector.create(accumulator);

{code}

For the mean time, made the changes to the patch, and added the test for 
{{null}} fixer.

 Unexpected performance difference between SamplingAccumulator and 
 StandardFacetAccumulator
 --

 Key: LUCENE-5015
 URL: https://issues.apache.org/jira/browse/LUCENE-5015
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Assignee: Shai Erera
Priority: Minor
 Attachments: LUCENE-5015.patch, LUCENE-5015.patch, LUCENE-5015.patch, 
 LUCENE-5015.patch, LUCENE-5015.patch


 I have an unexpected performance difference between the SamplingAccumulator 
 and the StandardFacetAccumulator. 
 The case is an index with about 5M documents and each document containing 
 about 10 fields. I created a facet on each of those fields. When searching to 
 retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
 about twice as fast as the StandardFacetAccumulator. This is expected and a 
 nice speed-up. 
 However, when I use more CountFacetRequests to retrieve facet-counts for more 
 than one field, the speeds of the SampingAccumulator decreases, to the point 
 where the StandardFacetAccumulator is faster. 
 {noformat} 
 FacetRequests  SamplingStandard
  1   391 ms 1100 ms
  2   531 ms 1095 ms 
  3   948 ms 1108 ms
  4  1400 ms 1110 ms
  5  1901 ms 1102 ms
 {noformat} 
 Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
 needs to do less work? 
 Some code to show what I do:
 {code}
   searcher.search( facetsQuery, facetsCollector );
   final ListFacetResult collectedFacets = 
 facetsCollector.getFacetResults();
 {code}
 {code}
 final FacetSearchParams facetSearchParams = new FacetSearchParams( 
 facetRequests );
 FacetsCollector facetsCollector;
 if ( isSampled )
 {
   facetsCollector =
   FacetsCollector.create( new SamplingAccumulator( new 
 RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
 }
 else
 {
   facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
 facetSearchParams, searcher.getIndexReader(), taxo ) );
 {code}
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator

2013-05-23 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664976#comment-13664976
 ] 

Gilad Barkai commented on LUCENE-5015:
--

Hello Rob,

Indeed that looks unexpected.
The immediate suspect is the fixing part of the sampling, where after sampled 
top-cK are computed for each facet request, each of the candidates for top-K 
gets a real count computation, rather than a count over the sampled set of 
results.

How many results are in the result set? All the documents?

 Unexpected performance difference between SamplingAccumulator and 
 StandardFacetAccumulator
 --

 Key: LUCENE-5015
 URL: https://issues.apache.org/jira/browse/LUCENE-5015
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Priority: Minor

 I have an unexpected performance difference between the SamplingAccumulator 
 and the StandardFacetAccumulator. 
 The case is an index with about 5M documents and each document containing 
 about 10 fields. I created a facet on each of those fields. When searching to 
 retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
 about twice as fast as the StandardFacetAccumulator. This is expected and a 
 nice speed-up. 
 However, when I use more CountFacetRequests to retrieve facet-counts for more 
 than one field, the speeds of the SampingAccumulator decreases, to the point 
 where the StandardFacetAccumulator is faster. 
 {noformat} 
 FacetRequests  SamplingStandard
  1   391 ms 1100 ms
  2   531 ms 1095 ms 
  3   948 ms 1108 ms
  4  1400 ms 1110 ms
  5  1901 ms 1102 ms
 {noformat} 
 Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
 needs to do less work? 
 Some code to show what I do:
 {code}
   searcher.search( facetsQuery, facetsCollector );
   final ListFacetResult collectedFacets = 
 facetsCollector.getFacetResults();
 {code}
 {code}
 final FacetSearchParams facetSearchParams = new FacetSearchParams( 
 facetRequests );
 FacetsCollector facetsCollector;
 if ( isSampled )
 {
   facetsCollector =
   FacetsCollector.create( new SamplingAccumulator( new 
 RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
 }
 else
 {
   facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
 facetSearchParams, searcher.getIndexReader(), taxo ) );
 {code}
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator

2013-05-23 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665018#comment-13665018
 ] 

Gilad Barkai commented on LUCENE-5015:
--

Sampling, with its defaults, has its toll. 

In its defaults, Sampling aims to produce the exact top-K results for each 
request, as if a {{StandardFacetAccumulator}} would have been used. Meaning it 
aims at producing the same top-K with the same counts.

The process begins with sampling the result set and computers the top-*cK* 
candidates for each of the *M* facet requests, producing amortized results. 
That part is faster than {{StandardFacetAccumulator}} because less documents' 
facets information gets processed.

The next part is the fixing, using a {{SampleFixer}} retrieved from a 
{{Sampler}}, in which fixed counts are produced which correlate better with 
the original document result set, rather than the sampled one. The default (and 
currently only implementation) for such fixer is {{TakmiSampleFixer}} which 
produced _exact_ counts for each of the *cK* candidates for each of the *M* 
facet requests. The counts are not computed against the facet information of 
each document, but rather matching the skiplist of the drill-down term, of each 
such candidate category with the bitset of the (actual) document results. The 
amount of matches is the count. 
This is equivalent to total-hit collector with a drilldown query for the 
candidate category over original query. 
There's tipping point in which not sampling is faster than sampling and fixing 
using *c* x *K* x *M* skiplists matches against the bitset representing the 
document results. *c* defaults to 2 (see overSampleFactor in SamplingParams); 

Over-sampling (a.k.a *c*) is important for exact counts, as it is conceivable 
that the accuracy of a sampled top-k is not 100%, but according to some 
measures we once ran it is very likely that the true top-K results are within 
the sampled *2K* results. Fixing those 2K with their actual counts and 
re-sorting them accordingly yields much more accurate top-K. 


E.g Requesting 5 count requests for top-10 with overSampleFactor of 2, results 
in 100 skiplist matching against the document results bitset.


If amortized results suffice, a different {{SampleFixer}} could be coded - 
which E.g amortize the true count from the sampling ration. E.g if category C 
got count of 3, and the sample was of 1,000 results out of a 1,000,000 than the 
AmortizedSampleFixer would fix the count of C to be 3,000.
Such fixing is very fast, and the overSampleFactor should be set to 1.0.

Edit:
I now see that it is not that easy to code a different SampleFixer, nor get it 
the information needed for the amortized result fixing as suggested above. 
I'll try to open the API some and make it more convenient.

 Unexpected performance difference between SamplingAccumulator and 
 StandardFacetAccumulator
 --

 Key: LUCENE-5015
 URL: https://issues.apache.org/jira/browse/LUCENE-5015
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Priority: Minor

 I have an unexpected performance difference between the SamplingAccumulator 
 and the StandardFacetAccumulator. 
 The case is an index with about 5M documents and each document containing 
 about 10 fields. I created a facet on each of those fields. When searching to 
 retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
 about twice as fast as the StandardFacetAccumulator. This is expected and a 
 nice speed-up. 
 However, when I use more CountFacetRequests to retrieve facet-counts for more 
 than one field, the speeds of the SampingAccumulator decreases, to the point 
 where the StandardFacetAccumulator is faster. 
 {noformat} 
 FacetRequests  SamplingStandard
  1   391 ms 1100 ms
  2   531 ms 1095 ms 
  3   948 ms 1108 ms
  4  1400 ms 1110 ms
  5  1901 ms 1102 ms
 {noformat} 
 Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
 needs to do less work? 
 Some code to show what I do:
 {code}
   searcher.search( facetsQuery, facetsCollector );
   final ListFacetResult collectedFacets = 
 facetsCollector.getFacetResults();
 {code}
 {code}
 final FacetSearchParams facetSearchParams = new FacetSearchParams( 
 facetRequests );
 FacetsCollector facetsCollector;
 if ( isSampled )
 {
   facetsCollector =
   FacetsCollector.create( new SamplingAccumulator( new 
 RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
 }
 else
 {
   facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
 

[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator

2013-05-23 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5015:
-

Attachment: LUCENE-5015.patch

Added a parameter to {{SamplingParams}} named {{fixToExact}} which defaults to 
{{false}}. 
I think it is probable that one who uses sampling may not be interested in 
exact results.

In the proposed approach, the {{Sampler}} would create either the old, slow, 
and accurate {{TakmiSampleFixer}} if {{SamplingParams.shouldFixToExact()}} is 
{{true}}. Otherwise the much (much!} faster {{AmortizedSampleFixer}} would be 
used, when it only take under account the sampling ratio, assuming the sampled 
set represent the whole set with 100% accuracy.

With these changes, the code above should already use the amortized fixer, as 
the default is now it.
If the old fixer is to be used - for comparison - the code could look as 
follows:

{code}
final FacetSearchParams facetSearchParams = new FacetSearchParams( 
facetRequests );

FacetsCollector facetsCollector;

if ( isSampled )
{
// Create SamplingParams which denotes fixing to exact
SamplingParams samplingParams = new SamplingParams();
samplingParams.setFixToExact(true);

// Use the custom sampling params while creating the RandomSampler
facetsCollector =
FacetsCollector.create( new SamplingAccumulator( new 
RandomSampler(samplingParams, new Random(someSeed)), facetSearchParams, 
searcher.getIndexReader(), taxo ) );
}
else
{
facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
facetSearchParams, searcher.getIndexReader(), taxo ) );
}
{code}

The sampling tests still use the exact fixer, as it is not easy asserting 
against amortized results. I'm still looking into creating a complete faceted 
search flow test with the amortized-fixer.

 Unexpected performance difference between SamplingAccumulator and 
 StandardFacetAccumulator
 --

 Key: LUCENE-5015
 URL: https://issues.apache.org/jira/browse/LUCENE-5015
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Priority: Minor
 Attachments: LUCENE-5015.patch


 I have an unexpected performance difference between the SamplingAccumulator 
 and the StandardFacetAccumulator. 
 The case is an index with about 5M documents and each document containing 
 about 10 fields. I created a facet on each of those fields. When searching to 
 retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
 about twice as fast as the StandardFacetAccumulator. This is expected and a 
 nice speed-up. 
 However, when I use more CountFacetRequests to retrieve facet-counts for more 
 than one field, the speeds of the SampingAccumulator decreases, to the point 
 where the StandardFacetAccumulator is faster. 
 {noformat} 
 FacetRequests  SamplingStandard
  1   391 ms 1100 ms
  2   531 ms 1095 ms 
  3   948 ms 1108 ms
  4  1400 ms 1110 ms
  5  1901 ms 1102 ms
 {noformat} 
 Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
 needs to do less work? 
 Some code to show what I do:
 {code}
   searcher.search( facetsQuery, facetsCollector );
   final ListFacetResult collectedFacets = 
 facetsCollector.getFacetResults();
 {code}
 {code}
 final FacetSearchParams facetSearchParams = new FacetSearchParams( 
 facetRequests );
 FacetsCollector facetsCollector;
 if ( isSampled )
 {
   facetsCollector =
   FacetsCollector.create( new SamplingAccumulator( new 
 RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
 }
 else
 {
   facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
 facetSearchParams, searcher.getIndexReader(), taxo ) );
 {code}
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5015) Unexpected performance difference between SamplingAccumulator and StandardFacetAccumulator

2013-05-23 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5015:
-

Attachment: LUCENE-5015.patch

Older patch was against trunk/lucene/facet. This one is rooted with trunk. 


 Unexpected performance difference between SamplingAccumulator and 
 StandardFacetAccumulator
 --

 Key: LUCENE-5015
 URL: https://issues.apache.org/jira/browse/LUCENE-5015
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.3
Reporter: Rob Audenaerde
Priority: Minor
 Attachments: LUCENE-5015.patch, LUCENE-5015.patch


 I have an unexpected performance difference between the SamplingAccumulator 
 and the StandardFacetAccumulator. 
 The case is an index with about 5M documents and each document containing 
 about 10 fields. I created a facet on each of those fields. When searching to 
 retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
 about twice as fast as the StandardFacetAccumulator. This is expected and a 
 nice speed-up. 
 However, when I use more CountFacetRequests to retrieve facet-counts for more 
 than one field, the speeds of the SampingAccumulator decreases, to the point 
 where the StandardFacetAccumulator is faster. 
 {noformat} 
 FacetRequests  SamplingStandard
  1   391 ms 1100 ms
  2   531 ms 1095 ms 
  3   948 ms 1108 ms
  4  1400 ms 1110 ms
  5  1901 ms 1102 ms
 {noformat} 
 Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
 needs to do less work? 
 Some code to show what I do:
 {code}
   searcher.search( facetsQuery, facetsCollector );
   final ListFacetResult collectedFacets = 
 facetsCollector.getFacetResults();
 {code}
 {code}
 final FacetSearchParams facetSearchParams = new FacetSearchParams( 
 facetRequests );
 FacetsCollector facetsCollector;
 if ( isSampled )
 {
   facetsCollector =
   FacetsCollector.create( new SamplingAccumulator( new 
 RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
 }
 else
 {
   facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
 facetSearchParams, searcher.getIndexReader(), taxo ) );
 {code}
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2013-02-07 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4609:
-

Attachment: SemiPackedEncoder.patch

Finally figured out I was doing things completely wrong.. instead of having a 
super smart optimizing code for semi-packed encoder - there's now a strait 
forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values 
are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per 
value (zero marker + 5 of VInt).

The idea, is to pay the penalty for variable length encoding for large values 
which should be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how 
useful is the non-specialized code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia 
index with the categories) - but I ran the EncodingSpeed test - and was 
surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the 
decoding speed is significantly faster. The new encode is the only (?!) encoder 
to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in 
decoding time.

Those values being used in the EncodingSpeed are real scenario, but I'm not 
sure how much they represent a common case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run 
the specialized {code}DGapSemiPackedEncoder{code}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or 
unique). It would be interesting to test it as well. We will pay in more I/O 
and much larger file size (~4 times larger..) but it doesn't mean it will be 
any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~1 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 
2430) 41152 times.

EncoderBits/Int 
 Encode TimeEncode Time  Decode Time
Decode Time
  
[milliseconds][microsecond / int]   [milliseconds]
[microsecond / int]
---
Simple  32. 
 190 1.9000  165
 1.6500
VInt8   18.4955 
 436 4.3600  359
 3.5900
Sorting(Unique(VInt8))  18.4955 
355735.5702  314
 3.1400
Sorting(Unique(DGap(VInt8))) 8.5597 
348534.8502  270
 2.7000
Sorting(Unique(DGapVInt8))   8.5597 
343434.3402  192
 1.9200
Sorting(Unique(DGap(SemiPacked)))8.6453 
338633.8602  156
 1.5600
Sorting(Unique(DGapSemiPacked))  8.6453 
339733.9702   99
 0.9900
Sorting(Unique(DGap(EightFlags(VInt  4.9679 
400240.0203  381
 3.8100
Sorting(Unique(DGap(FourFlags(VInt   4.8198 
397239.7203  399
 3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)4.5794 
444844.4803  645
 6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)4.5794 
446144.6103  641
 6.4100


Estimating ~1 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 
1489) 67159 times.

EncoderBits/Int 
 Encode TimeEncode Time  Decode Time
Decode Time
  
[milliseconds][microsecond / int]   [milliseconds]
[microsecond / int]

[jira] [Comment Edited] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2013-02-07 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573590#comment-13573590
 ] 

Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:29 PM:
--

Finally figured out I was doing things completely wrong.. instead of having a 
super smart optimizing code for semi-packed encoder - there's now a strait 
forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values 
are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per 
value (zero marker + 5 of VInt).

The idea, is to pay the penalty for variable length encoding for large values 
which should be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how 
useful is the non-specialized code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia 
index with the categories) - but I ran the {EncodingSpeed} test - and was 
surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the 
decoding speed is significantly faster. The new encode is the only (?!) encoder 
to beat {SimpleIntEncoder} (which writes plain 32 bits per value) in decoding 
time.

Those values being used in the EncodingSpeed are real scenario, but I'm not 
sure how much they represent a common case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run 
the specialized {DGapSemiPackedEncoder}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or 
unique). It would be interesting to test it as well. We will pay in more I/O 
and much larger file size (~4 times larger..) but it doesn't mean it will be 
any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~1 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 
2430) 41152 times.

EncoderBits/Int 
 Encode TimeEncode Time  Decode Time
Decode Time
  
[milliseconds][microsecond / int]   [milliseconds]
[microsecond / int]
---
Simple  32. 
 190 1.9000  165
 1.6500
VInt8   18.4955 
 436 4.3600  359
 3.5900
Sorting(Unique(VInt8))  18.4955 
355735.5702  314
 3.1400
Sorting(Unique(DGap(VInt8))) 8.5597 
348534.8502  270
 2.7000
Sorting(Unique(DGapVInt8))   8.5597 
343434.3402  192
 1.9200
Sorting(Unique(DGap(SemiPacked)))8.6453 
338633.8602  156
 1.5600
Sorting(Unique(DGapSemiPacked))  8.6453 
339733.9702   99
 0.9900
Sorting(Unique(DGap(EightFlags(VInt  4.9679 
400240.0203  381
 3.8100
Sorting(Unique(DGap(FourFlags(VInt   4.8198 
397239.7203  399
 3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)4.5794 
444844.4803  645
 6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)4.5794 
446144.6103  641
 6.4100


Estimating ~1 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 
1489) 67159 times.

EncoderBits/Int 
 Encode TimeEncode Time  Decode Time
Decode Time
  
[milliseconds][microsecond / int]   [milliseconds]
[microsecond / int]

[jira] [Comment Edited] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2013-02-07 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13573590#comment-13573590
 ] 

Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:30 PM:
--

Finally figured out I was doing things completely wrong.. instead of having a 
super smart optimizing code for semi-packed encoder - there's now a strait 
forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values 
are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per 
value (zero marker + 5 of VInt).

The idea, is to pay the penalty for variable length encoding for large values 
which should be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how 
useful is the non-specialized code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia 
index with the categories) - but I ran the {{EncodingSpeed}} test - and was 
surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the 
decoding speed is significantly faster. The new encode is the only (?!) encoder 
to beat {{SimpleIntEncoder}} (which writes plain 32 bits per value) in decoding 
time.

Those values being used in the EncodingSpeed are real scenario, but I'm not 
sure how much they represent a common case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run 
the specialized {{DGapSemiPackedEncoder}}.
Also, I'm not sure {{SimpleIntEncoder}} was ever used (without any sorting, or 
unique). It would be interesting to test it as well. We will pay in more I/O 
and much larger file size (~4 times larger..) but it doesn't mean it will be 
any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~1 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 
2430) 41152 times.

EncoderBits/Int 
 Encode TimeEncode Time  Decode Time
Decode Time
  
[milliseconds][microsecond / int]   [milliseconds]
[microsecond / int]
---
Simple  32. 
 190 1.9000  165
 1.6500
VInt8   18.4955 
 436 4.3600  359
 3.5900
Sorting(Unique(VInt8))  18.4955 
355735.5702  314
 3.1400
Sorting(Unique(DGap(VInt8))) 8.5597 
348534.8502  270
 2.7000
Sorting(Unique(DGapVInt8))   8.5597 
343434.3402  192
 1.9200
Sorting(Unique(DGap(SemiPacked)))8.6453 
338633.8602  156
 1.5600
Sorting(Unique(DGapSemiPacked))  8.6453 
339733.9702   99
 0.9900
Sorting(Unique(DGap(EightFlags(VInt  4.9679 
400240.0203  381
 3.8100
Sorting(Unique(DGap(FourFlags(VInt   4.8198 
397239.7203  399
 3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)4.5794 
444844.4803  645
 6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)4.5794 
446144.6103  641
 6.4100


Estimating ~1 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 
1489) 67159 times.

EncoderBits/Int 
 Encode TimeEncode Time  Decode Time
Decode Time
  
[milliseconds][microsecond / int]   [milliseconds]
[microsecond / int]

[jira] [Commented] (LUCENE-4748) Add DrillSideways helper class to Lucene facets module

2013-02-03 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569780#comment-13569780
 ] 

Gilad Barkai commented on LUCENE-4748:
--

Great idea!

Since drill-down followed by drill-sideways is a sort of (re)filtering over the 
original result set, perhaps the query result (say ScoredDocIds) could be 
passed through rather than re-evaluating the Query? 
IIRC the scores should not change during drill-down (and sideways as well), so 
without re-evaluating this could perhaps save some juice?

 Add DrillSideways helper class to Lucene facets module
 --

 Key: LUCENE-4748
 URL: https://issues.apache.org/jira/browse/LUCENE-4748
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4748.patch, LUCENE-4748.patch


 This came out of a discussion on the java-user list with subject
 Faceted search in OR: http://markmail.org/thread/jmnq6z2x7ayzci5k
 The basic idea is to count near misses during collection, ie
 documents that matched the main query and also all except one of the
 drill down filters.
 Drill sideways makes for a very nice faceted search UI because you
 don't lose the facet counts after drilling in.  Eg maybe you do a
 search for cameras, and you see facets for the manufacturer, so you
 drill into Nikon.
 With drill sideways, even after drilling down, you'll still get the
 counts for all the other brands, where each count tells you how many
 hits you'd get if you changed to a different manufacturer.
 This becomes more fun if you add further drill-downs, eg maybe I next drill
 down into Resolution=10 megapixels, and then I can see how many 10
 megapixel cameras all other manufacturers, and what other resolutions
 Nikon cameras offer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4715) Add OrdinalPolicy.NO_DIMENSION

2013-01-29 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565359#comment-13565359
 ] 

Gilad Barkai commented on LUCENE-4715:
--

Looking at the patch, I think I might misunderstood something - in the build 
method, for every category the right policy is checked, but the build itself is 
per CategoryListParam - so why cant the policy be the same for each CLP? If one 
wishes to get different policies etc - I think it would be logical to separate 
them to different clps, and this check should not be performed over each 
category?



 Add OrdinalPolicy.NO_DIMENSION
 --

 Key: LUCENE-4715
 URL: https://issues.apache.org/jira/browse/LUCENE-4715
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4715.patch


 With the move of OrdinalPolicy to CategoryListParams, 
 NonTopLevelOrdinalPolicy was nuked. It might be good to restore it, as 
 another enum value of OrdinalPolicy.
 It's the same like ALL_PARENTS, only doesn't add the dimension ordinal, which 
 could save space as well as computation time. It's good for when you don't 
 care about the count of Date/, but only about its children counts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4715) Add OrdinalPolicy.ALL_BUT_DIMENSION

2013-01-29 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13565570#comment-13565570
 ] 

Gilad Barkai commented on LUCENE-4715:
--

How can a mess be avoided when allowing different OrdinalPolicies in the same 
CLP ?
There would be ordinals which has the parents, and ordinals that dont? How can 
the collector or aggregator know which ordinals should be delt with as without 
parents and which should not?



 Add OrdinalPolicy.ALL_BUT_DIMENSION
 ---

 Key: LUCENE-4715
 URL: https://issues.apache.org/jira/browse/LUCENE-4715
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4715.patch


 With the move of OrdinalPolicy to CategoryListParams, 
 NonTopLevelOrdinalPolicy was nuked. It might be good to restore it, as 
 another enum value of OrdinalPolicy.
 It's the same like ALL_PARENTS, only doesn't add the dimension ordinal, which 
 could save space as well as computation time. It's good for when you don't 
 care about the count of Date/, but only about its children counts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4659) Cleanup CategoryPath

2013-01-06 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545363#comment-13545363
 ] 

Gilad Barkai commented on LUCENE-4659:
--

Patch looks really good.

I'm not concerned about the new objects for subPath. Actually, since the 
{{HashMap}} s are now against {{CategoryPath}} and it's not constantly being 
translated to/from {{String}} I'm looking forward a better performance than 
before.

Nice job, and a nasty one that must have been.. 
Chapeau à lui!

A few (minor) comments:
* {{copyFullPath()}} - {{numCharsCopied}} is redundant? The return value could 
have been {{(idx + component[upto].length() - start)}} 
* {{equals()}} line 143 - perhaps use only one index rather than both index 
{{j}} and {{i}} ? Also, CPs are more likely to be different at the end, than in 
the start (e.g further away from the root than the dimension) - perhaps iterate 
in reverse (up the tree)?

 Cleanup CategoryPath
 

 Key: LUCENE-4659
 URL: https://issues.apache.org/jira/browse/LUCENE-4659
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4659.patch, LUCENE-4659.patch


 CategoryPath is supposed to be a simple object which holds a category path's 
 components, and offers some utility methods that can be used during indexing 
 and search.
 Currently, it exposes lots of methods which aren't used, unless by tests - I 
 want to get rid of them. Also, the internal implementation manages 3 char[] 
 for holding the path components, while I think it would have been simpler if 
 it maintained a String[]. I'd like to explore that option too (the input is 
 anyway String, so why copy char[]?).
 Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most 
 of the mutable methods. The ones that remain will probably go away when I 
 move from char[] to String[]. Immuntability is important because in various 
 places in the code we convert a CategoryPath back and forth to String, with 
 TODOs to stop doing that if CP was immutable.
 Will attach a patch that covers the first step - get rid of unneeded methods 
 and beginning to make it immutable.
 Perhaps this can be done in multiple commits?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4659) Cleanup CategoryPath

2013-01-06 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13545370#comment-13545370
 ] 

Gilad Barkai commented on LUCENE-4659:
--

Magnificent, +1 for commit.

 Cleanup CategoryPath
 

 Key: LUCENE-4659
 URL: https://issues.apache.org/jira/browse/LUCENE-4659
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4659.patch, LUCENE-4659.patch, LUCENE-4659.patch


 CategoryPath is supposed to be a simple object which holds a category path's 
 components, and offers some utility methods that can be used during indexing 
 and search.
 Currently, it exposes lots of methods which aren't used, unless by tests - I 
 want to get rid of them. Also, the internal implementation manages 3 char[] 
 for holding the path components, while I think it would have been simpler if 
 it maintained a String[]. I'd like to explore that option too (the input is 
 anyway String, so why copy char[]?).
 Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most 
 of the mutable methods. The ones that remain will probably go away when I 
 move from char[] to String[]. Immuntability is important because in various 
 places in the code we convert a CategoryPath back and forth to String, with 
 TODOs to stop doing that if CP was immutable.
 Will attach a patch that covers the first step - get rid of unneeded methods 
 and beginning to make it immutable.
 Perhaps this can be done in multiple commits?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4659) Cleanup CategoryPath

2013-01-04 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544103#comment-13544103
 ] 

Gilad Barkai commented on LUCENE-4659:
--

I think I wished for CP to become immutable as one of my birthday wishes (while 
taking out the candles).. these do come true!
That's a very promising start, way to go!

CP's internals of char[] was for performance and reusability, I thought it was 
used internally in the CL2O cache - but now I see that it does not? hmm..
Other methods which could be killed, and are in the way of immutability: 
{{.setFromSerializable()}} and {{.add()}} which are only used in tests.

Than, the only method which breaks immutability is {{.trim()}}, and it could be 
managed with a C'tor. 

 Cleanup CategoryPath
 

 Key: LUCENE-4659
 URL: https://issues.apache.org/jira/browse/LUCENE-4659
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4659.patch


 CategoryPath is supposed to be a simple object which holds a category path's 
 components, and offers some utility methods that can be used during indexing 
 and search.
 Currently, it exposes lots of methods which aren't used, unless by tests - I 
 want to get rid of them. Also, the internal implementation manages 3 char[] 
 for holding the path components, while I think it would have been simpler if 
 it maintained a String[]. I'd like to explore that option too (the input is 
 anyway String, so why copy char[]?).
 Ultimately, I'd like CategoryPath to be immutable. I was able to get rid most 
 of the mutable methods. The ones that remain will probably go away when I 
 move from char[] to String[]. Immuntability is important because in various 
 places in the code we convert a CategoryPath back and forth to String, with 
 TODOs to stop doing that if CP was immutable.
 Will attach a patch that covers the first step - get rid of unneeded methods 
 and beginning to make it immutable.
 Perhaps this can be done in multiple commits?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-21 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13538193#comment-13538193
 ] 

Gilad Barkai commented on LUCENE-4609:
--

Thank you Adrian!
I'll will look into it.

 Write a PackedIntsEncoder/Decoder for facets
 

 Key: LUCENE-4609
 URL: https://issues.apache.org/jira/browse/LUCENE-4609
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4609.patch


 Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
 category ordinals. We have several such encoders, including VInt (default), 
 and block encoders.
 It would be interesting to implement and benchmark a 
 PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
 bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
 the max value you can see and (2) one that decides for each doc on the 
 optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-19 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4609:
-

Attachment: LUCENE-4609.patch

Attached a PackedEncoder, which is based on {{PackedInts}}. Currently only the 
approach of a 'per-document' bits-per-value is implemented.

I'm not convinced the header could be spared, as at the very least, the number 
of bits to neglect at the end of the stream should be written. E.g if there are 
2 bits per value, and there are 17 values, there's a need for 34 bits, but 
everything is written in (at least) bytes, so 6 bits should be neglected.

Updated EncodingTest and EncodingSpeed, and found out that the compression 
factor is not that good, probably due to large numbers which bumps the amount 
of required bits to higher value. 

Started to look into a semi-packed encoder, which could encode most values in a 
packed manner, but could also add large values as, e.g., vints.
Example: for 6 bits per value, all values 0-62 are packed, while a packed value 
of 63 (packed all 1' s) is a marker that the next value is written in a 
non-packed manner (say vint, Elias delta, whole 32 bits.. ). 
This should improve the compression factor when most ints are small, and only a 
few are large. 
Impact on encoding/decoding speed remains to be seen..

 Write a PackedIntsEncoder/Decoder for facets
 

 Key: LUCENE-4609
 URL: https://issues.apache.org/jira/browse/LUCENE-4609
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4609.patch


 Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
 category ordinals. We have several such encoders, including VInt (default), 
 and block encoders.
 It would be interesting to implement and benchmark a 
 PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
 bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
 the max value you can see and (2) one that decides for each doc on the 
 optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-19 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536347#comment-13536347
 ] 

Gilad Barkai commented on LUCENE-4609:
--

bq. Do you encode the gaps or the straight up ords?

Well, It's a 'end point' encoder, meaning it encodes whatever values are 
received directly to the output.
One could create an encoder as: {{new SortingIntEncoder(new 
UniqueValuesIntEncoder(new DGapIntEncoder(new PackedEncoder(}}, so the 
values the packed encoder would receive are already after sort, unique and 
dgap. 

{quote}
This is PForDelta compression (the outliers are encoded separately) I think? We 
can test it and see if it helps ... but we weren't so happy with it for 
encoding postings (it adds complexity, slows down decode, and didn't seem to 
help that much in reducing the size).
{quote}

PForDelta is indeed slower. But we've met scenarios in which most dgaps are 
small - hence the NOnes, and the Four/Eight Flag encoders. If indeed most 
values are small, say, could fit in 4 bits, but there's also one or two larger 
values which would require 12 or 14 bits, we could benefit hear greatly.
This is all relevant only where there are large amount of categories per 
document.

bq. it seems like you are writing the full header per field
That is right. To be frank, I'm not 100% sure what {{PackedInts}} does.. nor 
how large its header is.. 
But I think perhaps some header per doc is required anyway? For bits-per-value 
smaller than the size of a byte, there's a need to know how many bits should be 
left out from the last read byte. 

I started writing my own version as a first step toward the 'mixed' version, in 
which a 1 byte header is written, that contained both the the 'bits per value' 
as the first 5 bits, and the amount of extra bits in the last 3 bits. I'm still 
playing with it, hope to share it soon.

 Write a PackedIntsEncoder/Decoder for facets
 

 Key: LUCENE-4609
 URL: https://issues.apache.org/jira/browse/LUCENE-4609
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4609.patch


 Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
 category ordinals. We have several such encoders, including VInt (default), 
 and block encoders.
 It would be interesting to implement and benchmark a 
 PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
 bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
 the max value you can see and (2) one that decides for each doc on the 
 optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-19 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536384#comment-13536384
 ] 

Gilad Barkai commented on LUCENE-4609:
--

bq. Hopefully you don't need to separately encode leftover unused bits ... ie 
byte[].length (which is free here, since codec already stores this) should 
suffice.

I'm missing something.. if there are 2 bits per value, and the codec knows its 
only 1 byte, there could be either 1, 2, 3 or 4 values in that single byte. How 
could the decoder know when to stop without knowing how many bits should not be 
encoded at the end? 

 Write a PackedIntsEncoder/Decoder for facets
 

 Key: LUCENE-4609
 URL: https://issues.apache.org/jira/browse/LUCENE-4609
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4609.patch


 Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
 category ordinals. We have several such encoders, including VInt (default), 
 and block encoders.
 It would be interesting to implement and benchmark a 
 PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
 bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
 the max value you can see and (2) one that decides for each doc on the 
 optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

2012-12-19 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13536391#comment-13536391
 ] 

Gilad Barkai commented on LUCENE-4609:
--

In a unique encoding dgap, there's no zero, so in order to save that little 
extra bit, every gap could be encoded as a (gap - 1). Tricky is the word :)

 Write a PackedIntsEncoder/Decoder for facets
 

 Key: LUCENE-4609
 URL: https://issues.apache.org/jira/browse/LUCENE-4609
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4609.patch


 Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
 category ordinals. We have several such encoders, including VInt (default), 
 and block encoders.
 It would be interesting to implement and benchmark a 
 PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
 bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
 the max value you can see and (2) one that decides for each doc on the 
 optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4633) DirectoryTaxonomyWriter.replaceTaxonomy should refresh the reader

2012-12-16 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533324#comment-13533324
 ] 

Gilad Barkai commented on LUCENE-4633:
--

+1 patch looks good.

 DirectoryTaxonomyWriter.replaceTaxonomy should refresh the reader
 -

 Key: LUCENE-4633
 URL: https://issues.apache.org/jira/browse/LUCENE-4633
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.0
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4633.patch


 While migrating code to Lucene 4.0 I tripped it. If you call 
 replaceTaxonomy() with e.g. a taxonomy index that contains category a, and 
 then you try to add category a to the new taxonomy, it receives a new 
 ordinal!
 The reason is that replaceTaxo doesn't refresh the internal IndexReader, but 
 does clear the cache (as it should). This causes the next addCategory to not 
 find category a in the cache, and not in the reader instance at hand.
 Simple fix, I'll attach a patch with it and a test exposing the bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4622) TopKFacetsResultHandler should tie break sort by label not ord?

2012-12-12 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13529988#comment-13529988
 ] 

Gilad Barkai commented on LUCENE-4622:
--

I'm not fond of the post processing...
Today the sort is consistent. Lucene breaks even on doc ids, which order may 
not be consistent due to out of order merges. This is not the case with 
category ordinals.

If one wishes to post process they should be able to do so quite easy? But as 
pointed out, it might not produce the results as intended due to a lot of 
categories which scored the same and were left out.

 TopKFacetsResultHandler should tie break sort by label not ord?
 ---

 Key: LUCENE-4622
 URL: https://issues.apache.org/jira/browse/LUCENE-4622
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Reporter: Michael McCandless

 EG I now get these facets:
 {noformat}
 Author (5)
  Lisa (2)
  Frank (1)
  Susan (1)
  Bob (1)
 {noformat}
 The primary sort is by count, but secondary is by ord (= order in which they 
 were indexed), which is not really understandable/transparent to the end 
 user.  I think it'd be best if we could do tie-break sort by label ...
 But talking to Shai, this seems hard/costly to fix, because when visiting the 
 facet ords to collect the top K, we don't currently resolve to label, and in 
 the worst case (say my example had a million labels with count 1) that's a 
 lot of extra label lookups ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4619) Create a specialized path for facets counting

2012-12-12 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13530307#comment-13530307
 ] 

Gilad Barkai commented on LUCENE-4619:
--

Throwing in a crazy idea.. can the facedIndexingParams be part of 
IndexWriterConfig?

 Create a specialized path for facets counting
 -

 Key: LUCENE-4619
 URL: https://issues.apache.org/jira/browse/LUCENE-4619
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
 Attachments: LUCENE-4619.patch


 Mike and I have been discussing that on several issues (LUCENE-4600, 
 LUCENE-4602) and on GTalk ... it looks like the current API abstractions may 
 be responsible for some of the performance loss that we see, compared to 
 specialized code.
 During our discussion, we've decided to target a specific use case - facets 
 counting and work on it, top-to-bottom by reusing as much code as possible. 
 Specifically, we'd like to implement a FacetsCollector/Accumulator which can 
 do only counting (i.e. respects only CountFacetRequest), no sampling, 
 partitions and complements. The API allows us to do so very cleanly, and in 
 the context of that issue, we'd like to do the following:
 * Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams 
 and CategoryPath (List, Iterable, whatever) and adds the needed information 
 to both the taxonomy index as well as the search index.
 ** That API is similar in nature to CategoryDocumentBuilder, only easier to 
 consume -- it's just another field that you add to the Document.
 ** We'll have two extensions for it: PayloadFacetsField and 
 DocValuesFacetsField, so that we can benchmark the two approaches. 
 Eventually, one of them we believe, will be eliminated, and we'll remain w/ 
 just one (hopefully the DV one).
 * Implement either a FacetsAccumulator/Collector which takes a bunch of 
 CountFacetRequests and returns the top-counts.
 ** Aggregations are done in-collection, rather than post. Note that we have 
 LUCENE-4600 open for exploring that. Either we finish this exploration here, 
 or do it there. Just FYI that the issue exists.
 ** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open 
 a separate issue to explore improving that API to be bulk, and then we can 
 decide if this specialized Collector should use those abstractions, or be 
 really optimized for the facet counting case.
 * At the moment, this path will assume that a document holds multiple 
 dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike 
 for a document), and therefore use OrdPolicy.NO_PARENTS.
 ** Later, we'd like to explore how to have this specialized path handle the 
 ALL_PARENTS case too, as it shouldn't be so hard to do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results

2012-12-11 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4461:
-

Attachment: LUCENE-4461.patch

Proposed fix - In {{StandardFacetsAccumulator}}, guard against handling and 
merging the same request more than once.
Also a matching test is introduced, inspired by previous patch.

Thanks Rodrigo!

 Multiple FacetRequest with the same path creates inconsistent results
 -

 Key: LUCENE-4461
 URL: https://issues.apache.org/jira/browse/LUCENE-4461
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6
Reporter: Rodrigo Vega
  Labels: facet, faceted-search
 Attachments: LUCENE-4461.patch, LuceneFacetTest.java


 Multiple FacetRequest are getting merged into one creating wrong results in 
 this case:
 FacetSearchParams facetSearchParams = new FacetSearchParams();
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
 Problem can be fixed by defining hashcode and equals in certain way that 
 Lucene recognize we are talking about different requests.
 Attached test case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4610) Implement a NoParentsAccumulator

2012-12-11 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13529716#comment-13529716
 ] 

Gilad Barkai commented on LUCENE-4610:
--

Since taxonomies do not tend to change in depth (i.e, if FileType has only 
depth 1, it is not likely it will suddenly grow to depth 3) - than perhaps the 
definition of these N dimensions are single-valued should be set in a 
CategoryListParam ? 

Shai mentioned yesterday that adding the parents during aggregation is 
problematic when partitions are in place. Parents might not be on the same 
partition as their children, nor the aggregator is aware of the partition it is 
currently aggregating upon. For partitions it seems adding the parents in a 
post process - as was first suggested - is the right approach. 

 Implement a NoParentsAccumulator
 

 Key: LUCENE-4610
 URL: https://issues.apache.org/jira/browse/LUCENE-4610
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera

 Mike experimented with encoding just the exact categories ordinals on 
 LUCENE-4602, and I added OrdinalPolicy.NO_PARENTS, with a comment saying that 
 this requires a special FacetsAccumulator.
 The idea is to write the exact categories only for each document, and then at 
 search time count up the parents chain to compute requested facets (I say 
 count, but it can be any weight).
 One limitation of such accumulator is that it cannot be used when e.g. a 
 document is associated with two categories who share the same parent, because 
 that may result in incorrect weights computed (e.g. a document might have 
 several Authors, and so counting the Author facet may yield wrong counts). So 
 it can be used only when the app knows it doesn't add such facets, or that it 
 always asks to aggregate a 'root' that in its path this criteria doesn't hold 
 (no categories share the same parent).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4600) Facets should aggregate during collection, not at the end

2012-12-07 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13527077#comment-13527077
 ] 

Gilad Barkai commented on LUCENE-4600:
--

Aggregating all doc ids first also make it easier to compute actual results 
after sampling. 
That is done by taking the sampling result top-(c)K and calculating their true 
value over all matching documents, giving the benefit of sampling and results 
which could make sense to the user (e.g in counting the end number would 
actually be the number of matching documents to this category).

As for aggregating 'on the fly' it has some other issues
* It (was?) believed that accessing the counting array during query execution 
may lead to memory cache issues. The entire counting array could be accessed 
for every document over and over, and it's not guaranteed it would fit into the 
cache (that's the CPU's one). That might not be a problem on modern hardware
* While the OS can cache all payload data itself, it gets difficult as the 
index grows. If the OS fails to cache the file, it is (again, was?) believed 
that going over the file in sequential manner once without seeks (at least by 
the current thread) would make it faster.

It sort of becoming a religion with all those believes, as some scenarios 
used to make sense a few years ago. I'm not sure they still do. 
Can't wait to see how some of these co-exist with the benchmark results.
If all religions could have been benchmarked... ;)



 Facets should aggregate during collection, not at the end
 -

 Key: LUCENE-4600
 URL: https://issues.apache.org/jira/browse/LUCENE-4600
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 Today the facet module simply gathers all hits (as a bitset, optionally with 
 a float[] to hold scores as well, if you will aggregate them) during 
 collection, and then at the end when you call getFacetsResults(), it makes a 
 2nd pass over all those hits doing the actual aggregation.
 We should investigate just aggregating as we collect instead, so we don't 
 have to tie up transient RAM (fairly small for the bit set but possibly big 
 for the float[]).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4565) Simplify TaxoReader ParentArray/ChildrenArrays

2012-12-06 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511315#comment-13511315
 ] 

Gilad Barkai commented on LUCENE-4565:
--

Patch's OK, some comments:

* The notion of youngestChild and olderSibling is important. I would not have 
removed the 'older/youngest' parts. Especially not from the TaxoArray's API. 
But would rather have them everywhere.

* In DirectoryTaxonomyWriter, the loop over the termEnums had been changed to 
reuse the TermsEnum and DocsEnum - only they are not reused. I got confused 
trying to find the reusability. Is there an internal java expert optimization 
for such cases?   



 Simplify TaxoReader ParentArray/ChildrenArrays
 --

 Key: LUCENE-4565
 URL: https://issues.apache.org/jira/browse/LUCENE-4565
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Attachments: LUCENE-4565.patch


 TaxoReader exposes two structures which provide information about a 
 categories parent/childs/siblings: ParentArray and ChildrenArrays. 
 ChildrenArrays are derived (i.e. created) from ParentArray.
 I propose to consolidate all that into one API ParentInfo, or 
 CategoryTreeInfo (a better name?) which will provide the same information, 
 only from one object. So instead of making these calls:
 {code}
 int[] parents = taxoReader.getParentArray();
 int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray();
 int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray();
 {code}
 one would make these calls:
 {code}
 int[] parents = taxoReader.getParentInfo().parents();
 int[] youngestChilds = taxoReader.getParentInfo().youngestChilds();
 int[] olderSiblings = taxoReader.getParentInfo().olderSiblings();
 {code}
 Not a big change, just consolidate more code into one logical place. All of 
 these arrays will continue to be lazily allocated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4565) Simplify TaxoReader ParentArray/ChildrenArrays

2012-12-06 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511331#comment-13511331
 ] 

Gilad Barkai commented on LUCENE-4565:
--

bq. I thought that it's over-verbosing to put them in the method names. Rather, 
I think that good javadocs are what's needed here. It's an API that one reads 
one time usually.

{{.children()}} looks like it's getting all children, it does not. 
{{.siblings()}} looks like it returns all siblings, which it does not. I think 
a good JavaDoc is a blessing, but it's not a penalty in making the code 
document itself - which it did till now. I see no reason to change that.

bq. They are reused
My bad! I need coffee...

 Simplify TaxoReader ParentArray/ChildrenArrays
 --

 Key: LUCENE-4565
 URL: https://issues.apache.org/jira/browse/LUCENE-4565
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Attachments: LUCENE-4565.patch


 TaxoReader exposes two structures which provide information about a 
 categories parent/childs/siblings: ParentArray and ChildrenArrays. 
 ChildrenArrays are derived (i.e. created) from ParentArray.
 I propose to consolidate all that into one API ParentInfo, or 
 CategoryTreeInfo (a better name?) which will provide the same information, 
 only from one object. So instead of making these calls:
 {code}
 int[] parents = taxoReader.getParentArray();
 int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray();
 int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray();
 {code}
 one would make these calls:
 {code}
 int[] parents = taxoReader.getParentInfo().parents();
 int[] youngestChilds = taxoReader.getParentInfo().youngestChilds();
 int[] olderSiblings = taxoReader.getParentInfo().olderSiblings();
 {code}
 Not a big change, just consolidate more code into one logical place. All of 
 these arrays will continue to be lazily allocated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4565) Simplify TaxoReader ParentArray/ChildrenArrays

2012-12-05 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13510603#comment-13510603
 ] 

Gilad Barkai commented on LUCENE-4565:
--

FamilyTree?
Genealogy?
How about Stemma ?

 Simplify TaxoReader ParentArray/ChildrenArrays
 --

 Key: LUCENE-4565
 URL: https://issues.apache.org/jira/browse/LUCENE-4565
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor

 TaxoReader exposes two structures which provide information about a 
 categories parent/childs/siblings: ParentArray and ChildrenArrays. 
 ChildrenArrays are derived (i.e. created) from ParentArray.
 I propose to consolidate all that into one API ParentInfo, or 
 CategoryTreeInfo (a better name?) which will provide the same information, 
 only from one object. So instead of making these calls:
 {code}
 int[] parents = taxoReader.getParentArray();
 int[] youngestChilds = taxoReader.getChildrenArrays().getYoungestChildArray();
 int[] olderSiblings = taxoReader.getChildrenArrays().getOlderSiblingArray();
 {code}
 one would make these calls:
 {code}
 int[] parents = taxoReader.getParentInfo().parents();
 int[] youngestChilds = taxoReader.getParentInfo().youngestChilds();
 int[] olderSiblings = taxoReader.getParentInfo().olderSiblings();
 {code}
 Not a big change, just consolidate more code into one logical place. All of 
 these arrays will continue to be lazily allocated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a ConstantScoreQuery

2012-12-04 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509628#comment-13509628
 ] 

Gilad Barkai commented on LUCENE-4580:
--

+1

 Facet DrillDown should return a ConstantScoreQuery
 --

 Key: LUCENE-4580
 URL: https://issues.apache.org/jira/browse/LUCENE-4580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4580.patch, LUCENE-4580.patch


 DrillDown is a helper class which the user can use to convert a facet value 
 that a user selected into a Query for performing drill-down or narrowing the 
 results. The API has several static methods that create e.g. a Term or Query.
 Rather than creating a Query, it would make more sense to create a Filter I 
 think. In most cases, the clicked facets should not affect the scoring of 
 documents. Anyway, even if it turns out that it must return a Query (which I 
 doubt), we should at least modify the impl to return a ConstantScoreQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4586) Change default ResultMode of FacetRequest to PER_NODE_IN_TREE

2012-12-04 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509847#comment-13509847
 ] 

Gilad Barkai commented on LUCENE-4586:
--

Patch looks good, ready for a commit.
+1

 Change default ResultMode of FacetRequest to PER_NODE_IN_TREE
 -

 Key: LUCENE-4586
 URL: https://issues.apache.org/jira/browse/LUCENE-4586
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 4.1, 5.0

 Attachments: LUCENE-4586.patch, LUCENE-4586.patch


 Today the default ResultMode is GLOBAL_FLAT, but it should be 
 PER_NODE_IN_TREE. ResultMode is being used whenever you set the depth of 
 FacetRequest to greater than 1. The difference between the two is:
 * PER_NODE_IN_TREE would then compute the top-K categories recursively, for 
 every top category at every level (up to depth). The results are returned in 
 a tree structure as well. For instance:
 {noformat}
 Date
   2010
 March
 February
   2011
 April
 May
 {noformat}
 * GLOBAL_FLAT computes the top categories among all the nodes up to depth, 
 and returns a flat list of categories.
 GLOBAL_FLAT is faster to compute than PER_NODE_IN_TREE (it just computes 
 top-K among N total categories), however I think that it's less intuitive, 
 and therefore should not be used as a default. In fact, I think this is kind 
 of an expert usage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4580) Facet DrillDown should return a ConstantScoreQuery

2012-12-03 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509527#comment-13509527
 ] 

Gilad Barkai edited comment on LUCENE-4580 at 12/4/12 5:47 AM:
---

Patch looks good.

Two comments:
* In {{SimpleSearcher}} the code is changed as such the {{DrillDown.query()}} 
always takes a {{new FacetSearchParams()}} (no default) - but that's not 
obvious it is the same one used for the search, or that it contains the 
{{FacetIndexingParams}} used for indexing. The defaults made sure that if none 
was specified along the way - the user should not bother himself with it. It is 
a small piece of code, but in an example might confuse the reader. I'm somewhat 
leaning toward allowing a default query() call without specifying a FSP.
* In the tests, perhaps use Float.MIN_VALUE or 0f when asserting equality of 
floats instead against 0.0.


  was (Author: gilad):
Patch looks good.

Two comments:
* In {{SimpleSearcher}} the code is change as such the {{DrillDown.query()}} 
always takes a {{new FacetSearchParams()}} (no default) - but that's not 
obvious it is the same one used for the search, or that it contains the 
{{FacetIndexingParams}} used for indexing. The defaults made sure that if none 
was specified along the way - the user should not bother himself with it. It is 
a small piece of code, but in an example might confuse the reader. I'm somewhat 
leaning toward allowing a default query() call without specifying a FSP.
* In the tests, perhaps use Float.MIN_VALUE or 0f when asserting equality of 
floats instead against 0.0.

  
 Facet DrillDown should return a ConstantScoreQuery
 --

 Key: LUCENE-4580
 URL: https://issues.apache.org/jira/browse/LUCENE-4580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4580.patch


 DrillDown is a helper class which the user can use to convert a facet value 
 that a user selected into a Query for performing drill-down or narrowing the 
 results. The API has several static methods that create e.g. a Term or Query.
 Rather than creating a Query, it would make more sense to create a Filter I 
 think. In most cases, the clicked facets should not affect the scoring of 
 documents. Anyway, even if it turns out that it must return a Query (which I 
 doubt), we should at least modify the impl to return a ConstantScoreQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a ConstantScoreQuery

2012-12-03 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13509527#comment-13509527
 ] 

Gilad Barkai commented on LUCENE-4580:
--

Patch looks good.

Two comments:
* In {{SimpleSearcher}} the code is change as such the {{DrillDown.query()}} 
always takes a {{new FacetSearchParams()}} (no default) - but that's not 
obvious it is the same one used for the search, or that it contains the 
{{FacetIndexingParams}} used for indexing. The defaults made sure that if none 
was specified along the way - the user should not bother himself with it. It is 
a small piece of code, but in an example might confuse the reader. I'm somewhat 
leaning toward allowing a default query() call without specifying a FSP.
* In the tests, perhaps use Float.MIN_VALUE or 0f when asserting equality of 
floats instead against 0.0.


 Facet DrillDown should return a ConstantScoreQuery
 --

 Key: LUCENE-4580
 URL: https://issues.apache.org/jira/browse/LUCENE-4580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor
 Attachments: LUCENE-4580.patch


 DrillDown is a helper class which the user can use to convert a facet value 
 that a user selected into a Query for performing drill-down or narrowing the 
 results. The API has several static methods that create e.g. a Term or Query.
 Rather than creating a Query, it would make more sense to create a Filter I 
 think. In most cases, the clicked facets should not affect the scoring of 
 documents. Anyway, even if it turns out that it must return a Query (which I 
 doubt), we should at least modify the impl to return a ConstantScoreQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a Filter not Query

2012-12-02 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508195#comment-13508195
 ] 

Gilad Barkai commented on LUCENE-4580:
--

{{DrillDown}} is a useful class with a straight-forward API, which makes the 
life of basic users simpler.
As Shai pointed out, today there is no dependency on the Queries module, but 
the code contains a hidden bug in which a 'drill down' operation may change the 
score of the results. And adding a Filter or a {{ConstantScoreQuery}} looks the 
right way to go.
That sort of a fix is possible, while keeping the usefulness of the DrillDown 
class, only if the code becomes dependent on the queries module.
On the other hand, removing the dependency would force most faceted users to 
write that exact extra code as mentioned. Preventing such cases was the reason 
that utility class was created.

'Drilling Down' is a basic feature of a faceted search application, and the 
DrillDown class provides an easy way of invoking it.
Having a faceted search application without utilizing the queries module (e.g 
filtering) seems remote - is there any such scenario?
Module dependency may result with a user loading jars he does not need or care 
about, but the queries module jar is likely to be found on any faceted search 
application.

Modules should be independent, but I see enough gain in here. It would not 
bother me if the faceted module would depend on the query module. I find it 
logical.

-1 for forcing users to write same code over and over to keep facet module 
independent of the queries module
+1 for adding {{DrillDown.filter(CategoryPath...)}} - That looks like the way 
to go


 Facet DrillDown should return a Filter not Query
 

 Key: LUCENE-4580
 URL: https://issues.apache.org/jira/browse/LUCENE-4580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor

 DrillDown is a helper class which the user can use to convert a facet value 
 that a user selected into a Query for performing drill-down or narrowing the 
 results. The API has several static methods that create e.g. a Term or Query.
 Rather than creating a Query, it would make more sense to create a Filter I 
 think. In most cases, the clicked facets should not affect the scoring of 
 documents. Anyway, even if it turns out that it must return a Query (which I 
 doubt), we should at least modify the impl to return a ConstantScoreQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a Filter not Query

2012-12-02 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13508202#comment-13508202
 ] 

Gilad Barkai commented on LUCENE-4580:
--

bq. it should be a combination of TermsFilter and BooleanFilter. So in fact if 
we want to keep DrillDown behave like today, we should use BooleanFilter and 
TermsFilter.

+1

 Facet DrillDown should return a Filter not Query
 

 Key: LUCENE-4580
 URL: https://issues.apache.org/jira/browse/LUCENE-4580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor

 DrillDown is a helper class which the user can use to convert a facet value 
 that a user selected into a Query for performing drill-down or narrowing the 
 results. The API has several static methods that create e.g. a Term or Query.
 Rather than creating a Query, it would make more sense to create a Filter I 
 think. In most cases, the clicked facets should not affect the scoring of 
 documents. Anyway, even if it turns out that it must return a Query (which I 
 doubt), we should at least modify the impl to return a ConstantScoreQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4580) Facet DrillDown should return a Filter not Query

2012-12-01 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13507982#comment-13507982
 ] 

Gilad Barkai commented on LUCENE-4580:
--

+1

 Facet DrillDown should return a Filter not Query
 

 Key: LUCENE-4580
 URL: https://issues.apache.org/jira/browse/LUCENE-4580
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Shai Erera
Priority: Minor

 DrillDown is a helper class which the user can use to convert a facet value 
 that a user selected into a Query for performing drill-down or narrowing the 
 results. The API has several static methods that create e.g. a Term or Query.
 Rather than creating a Query, it would make more sense to create a Filter I 
 think. In most cases, the clicked facets should not affect the scoring of 
 documents. Anyway, even if it turns out that it must return a Query (which I 
 doubt), we should at least modify the impl to return a ConstantScoreQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3441) Add NRT support to LuceneTaxonomyReader

2012-11-21 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501966#comment-13501966
 ] 

Gilad Barkai commented on LUCENE-3441:
--

Patch looks very good.

+1.

 Add NRT support to LuceneTaxonomyReader
 ---

 Key: LUCENE-3441
 URL: https://issues.apache.org/jira/browse/LUCENE-3441
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Attachments: LUCENE-3441.patch, LUCENE-3441.patch


 Currently LuceneTaxonomyReader does not support NRT - i.e., on changes to 
 LuceneTaxonomyWriter, you cannot have the reader updated, like 
 IndexReader/Writer. In order to do that we need to do the following:
 # Add ctor to LuceneTaxonomyReader to allow you to instantiate it with 
 LuceneTaxonomyWriter.
 # Add API to LuceneTaxonomyWriter to expose its internal IndexReader
 # Change LTR.refresh() to return an LTR, rather than void. This is actually 
 not strictly related to that issue, but since we'll need to modify refresh() 
 impl, I think it'll be good to change its API as well. Since all of facet API 
 is @lucene.experimental, no backwards issues here (and the sooner we do it, 
 the better).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4532) TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy failure

2012-11-06 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13491466#comment-13491466
 ] 

Gilad Barkai commented on LUCENE-4532:
--

Reviewed the patch - and it looks very good.

A few comments:
1. in TestDirectoryTaxonomyWriter.java, the error string _index.create.time 
not found in commitData_ should be updated.
2. if the index creation time is in the commit data, it will not be removed - 
as the epoch is added to whatever commit data was read from the index. I think 
perhaps it should be removed?
3. since the members related to the old 'timestamp' method are removed - no 
test could check the migration from old to new methods. Might be a good idea to 
add one with a comment to remove it when backward compatibility is no longer 
required (Lucene 6?).
4. 'Epoch' is usually in the context of time, or in relation of a period. 
Perhaps the name 'version' is more closely related to the implementation?



 TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy failure
 

 Key: LUCENE-4532
 URL: https://issues.apache.org/jira/browse/LUCENE-4532
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Reporter: Shai Erera
Assignee: Shai Erera
 Attachments: LUCENE-4532.patch


 The following failure on Jenkins:
 {noformat}
  Build: http://jenkins.sd-datasolutions.de/job/Lucene-Solr-4.x-Windows/1404/
  Java: 32bit/jdk1.6.0_37 -client -XX:+UseConcMarkSweepGC
 
  1 tests failed.
  REGRESSION:  
  org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy
 
  Error Message:
 
 
  Stack Trace:
  java.lang.ArrayIndexOutOfBoundsException
  at 
  __randomizedtesting.SeedInfo.seed([6AB10D3E4E956CFA:BFB2863DB7E077E0]:0)
  at java.lang.System.arraycopy(Native Method)
  at 
  org.apache.lucene.facet.taxonomy.directory.ParentArray.refresh(ParentArray.java:99)
  at 
  org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader.refresh(DirectoryTaxonomyReader.java:407)
  at 
  org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.doTestReadRecreatedTaxono(TestDirectoryTaxonomyReader.java:167)
  at 
  org.apache.lucene.facet.taxonomy.directory.TestDirectoryTaxonomyReader.testRefreshReadRecreatedTaxonomy(TestDirectoryTaxonomyReader.java:130)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787)
  at 
  org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
  at 
  org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51)
  at 
  org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at 
  com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
  at 
  org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at 
  org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
  at 
  org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
  at 
  com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at 
  com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358)
  at 
  com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782)
  at 
  com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746)
  at 
  com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:648)
  at 
  

[jira] [Commented] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results

2012-10-09 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13472164#comment-13472164
 ] 

Gilad Barkai commented on LUCENE-4461:
--

Nice catch!

Took a while to pinpoint the reason - lines 173-181 of 
StandardFacetsAccumulator.
In the mentioned lines, a 'merge' is performed over categories which matched 
the request, but reside on different partitions. 

bq. Partitions are an optimization which limit the RAM requirements per query 
to a constant, rather than linear to the taxonomy size (could be millions of 
categories). The taxonomy is virtually splitted into partitions of constant 
size, a top-k is heaped from each partition, and all those top-k results are 
being merged to a global top-k list

The proposed solution of changing the hashCode and equals so that the same 
request will have two hashCodes and will not be equal to itself is very likely 
to break other parts of the code.

Perhaps such cases could be prevented all together? e.g throwing an exception 
when the (exact) same request is added twice. 
Is that a reasonable solution? Are there cases where it is necessary to request 
the same path twice? 
Please note that a different count, depth, path etc - makes a different 
request, so requesting author with count 10 and count 11 makes different 
requests - which are handled simultaneously correctly in current versions. 


 Multiple FacetRequest with the same path creates inconsistent results
 -

 Key: LUCENE-4461
 URL: https://issues.apache.org/jira/browse/LUCENE-4461
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6
Reporter: Rodrigo Vega
  Labels: facet, faceted-search
 Attachments: LuceneFacetTest.java


 Multiple FacetRequest are getting merged into one creating wrong results in 
 this case:
 FacetSearchParams facetSearchParams = new FacetSearchParams();
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
 Problem can be fixed by defining hashcode and equals in certain way that 
 Lucene recognize we are talking about different requests.
 Attached test case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results

2012-10-09 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13472416#comment-13472416
 ] 

Gilad Barkai commented on LUCENE-4461:
--

The same category can be set as a filter and as a request without them 
colliding - a filter is correlated or dependent on a facet request. 
facets filters are done at the query level which affects the result set, while 
the facetRequest defines which categories to retrieve out of the result set.
I probably miss something here :)



 Multiple FacetRequest with the same path creates inconsistent results
 -

 Key: LUCENE-4461
 URL: https://issues.apache.org/jira/browse/LUCENE-4461
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6
Reporter: Rodrigo Vega
  Labels: facet, faceted-search
 Attachments: LuceneFacetTest.java


 Multiple FacetRequest are getting merged into one creating wrong results in 
 this case:
 FacetSearchParams facetSearchParams = new FacetSearchParams();
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
 Problem can be fixed by defining hashcode and equals in certain way that 
 Lucene recognize we are talking about different requests.
 Attached test case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results

2012-10-09 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13472416#comment-13472416
 ] 

Gilad Barkai edited comment on LUCENE-4461 at 10/9/12 2:16 PM:
---

The same category can be set as a filter and as a request without them 
colliding - a filter is not correlated or dependent on a facet request. 
facets filters are done at the query level which affects the result set, while 
the facetRequest defines which categories to retrieve out of the result set.
I probably miss something here :)



  was (Author: gilad):
The same category can be set as a filter and as a request without them 
colliding - a filter is correlated or dependent on a facet request. 
facets filters are done at the query level which affects the result set, while 
the facetRequest defines which categories to retrieve out of the result set.
I probably miss something here :)


  
 Multiple FacetRequest with the same path creates inconsistent results
 -

 Key: LUCENE-4461
 URL: https://issues.apache.org/jira/browse/LUCENE-4461
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6
Reporter: Rodrigo Vega
  Labels: facet, faceted-search
 Attachments: LuceneFacetTest.java


 Multiple FacetRequest are getting merged into one creating wrong results in 
 this case:
 FacetSearchParams facetSearchParams = new FacetSearchParams();
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
 Problem can be fixed by defining hashcode and equals in certain way that 
 Lucene recognize we are talking about different requests.
 Attached test case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4461) Multiple FacetRequest with the same path creates inconsistent results

2012-10-09 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13472437#comment-13472437
 ] 

Gilad Barkai commented on LUCENE-4461:
--

Well solve you code issues no :) But the current code is indeed broken by the 
issue you raised - and that should be fixed.
I re examined the code, and I think the different hashcode you presented will 
work - though please note it will consume some extra CPU, as the same request 
will be handled twice (that's the heap to figure out the top-k of the request) 
to create separate FacetResults for each request.



 Multiple FacetRequest with the same path creates inconsistent results
 -

 Key: LUCENE-4461
 URL: https://issues.apache.org/jira/browse/LUCENE-4461
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6
Reporter: Rodrigo Vega
  Labels: facet, faceted-search
 Attachments: LuceneFacetTest.java


 Multiple FacetRequest are getting merged into one creating wrong results in 
 this case:
 FacetSearchParams facetSearchParams = new FacetSearchParams();
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
   facetSearchParams.addFacetRequest(new CountFacetRequest(new 
 CategoryPath(author), 10));
 Problem can be fixed by defining hashcode and equals in certain way that 
 Lucene recognize we are talking about different requests.
 Attached test case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect

2012-09-24 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4411:
-

Attachment: LUCENE-4411.patch

Attached a proposed fix + test.

Delegating is impossible as {{FacetRequest}}'s getters are {{final}}. The only 
way to 'delegate' the information is using the setters in the wrapping class 
(e.g setDepth(original.getDepth()).
This solution does not seem the right one, but other approaches involves 
changing much more code and reducing the amount of protection the public API 
offers (e.g the user will find it easier to break something).

The patch also introduce (the missing) delegation of the {{SortBy}}, 
{{SortOrder}}, {{numResultsToLable}} and {{ResultMode}} methods.

Somewhat out of scope of the issue - I tried to figure out why the wrapping and 
keeping the ??original?? request is important:
The ??count?? (number of categories to return) is {{final}}, set at 
construction. While Sampling is in effect, in order to better the chances of 
'hitting' the true top-k results, the notion of oversampling is introduced, 
which asks for more than just K (e.g 3 * K results) - so another request should 
be made. The 'original' request is saves so the end-result would hold the 
original request, and not the over-sampled one (every {{FacetResult}} has its 
originating {{FacetRequest}}).


 Depth requested in a facetRequest is reset when Sampling is in effect
 -

 Key: LUCENE-4411
 URL: https://issues.apache.org/jira/browse/LUCENE-4411
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6.1, 5.0, 4.0
Reporter: Gilad Barkai
 Attachments: LUCENE-4411.patch, OversampleWithDepthTest.java, 
 OversampleWithDepthTest.java


 FacetRequest can be set a Depth parameter, which controls the depth of the 
 result tree to be returned.
 When Sampling is enabled (and actually used) the Depth parameter gets reset 
 to its default (1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect

2012-09-20 Thread Gilad Barkai (JIRA)
Gilad Barkai created LUCENE-4411:


 Summary: Depth requested in a facetRequest is reset when Sampling 
is in effect
 Key: LUCENE-4411
 URL: https://issues.apache.org/jira/browse/LUCENE-4411
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6.1, 5.0, 4.0
Reporter: Gilad Barkai
 Attachments: OversampleWithDepthTest.java

FacetRequest can be set a Depth parameter, which controls the depth of the 
result tree to be returned.
When Sampling is enabled (and actually used) the Depth parameter gets reset to 
its default (1).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect

2012-09-20 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4411:
-

Attachment: OversampleWithDepthTest.java

A Test revealing the bug for Trunk

 Depth requested in a facetRequest is reset when Sampling is in effect
 -

 Key: LUCENE-4411
 URL: https://issues.apache.org/jira/browse/LUCENE-4411
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6.1, 5.0, 4.0
Reporter: Gilad Barkai
 Attachments: OversampleWithDepthTest.java


 FacetRequest can be set a Depth parameter, which controls the depth of the 
 result tree to be returned.
 When Sampling is enabled (and actually used) the Depth parameter gets reset 
 to its default (1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect

2012-09-20 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4411:
-

Attachment: OversampleWithDepthTest.java

A Test revealing the bug for 3.6

 Depth requested in a facetRequest is reset when Sampling is in effect
 -

 Key: LUCENE-4411
 URL: https://issues.apache.org/jira/browse/LUCENE-4411
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6.1, 5.0, 4.0
Reporter: Gilad Barkai
 Attachments: OversampleWithDepthTest.java, 
 OversampleWithDepthTest.java


 FacetRequest can be set a Depth parameter, which controls the depth of the 
 result tree to be returned.
 When Sampling is enabled (and actually used) the Depth parameter gets reset 
 to its default (1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4411) Depth requested in a facetRequest is reset when Sampling is in effect

2012-09-20 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13459543#comment-13459543
 ] 

Gilad Barkai commented on LUCENE-4411:
--

When sampling is under effect, the original FacetRequest is replaces with a 
wrapper, which takes under account different sampling related parameters named 
OverSampledFacetRequest.
This wrapping class modifies only a small set of parameters in the request 
and should otherwise delegate everything to the original one - but it does not.
Some of the information that is lost from the original request: SortOrder, 
SortBy, number of results to lable, ResultMode and Depth.

A patch would be available shortly.
 

 Depth requested in a facetRequest is reset when Sampling is in effect
 -

 Key: LUCENE-4411
 URL: https://issues.apache.org/jira/browse/LUCENE-4411
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6.1, 5.0, 4.0
Reporter: Gilad Barkai
 Attachments: OversampleWithDepthTest.java, 
 OversampleWithDepthTest.java


 FacetRequest can be set a Depth parameter, which controls the depth of the 
 result tree to be returned.
 When Sampling is enabled (and actually used) the Depth parameter gets reset 
 to its default (1).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4302) Javadoc for facet User Guide does not display because of SAXParseException (Eclipse, Maven)

2012-08-13 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13433404#comment-13433404
 ] 

Gilad Barkai commented on LUCENE-4302:
--

I'm with Shai on this - According to w3schools this is actually not a problem.
Please see: http://www.w3schools.com/tags/tag_br.asp

In HTML br has no closing tag. 


 Javadoc for facet User Guide does not display because of SAXParseException 
 (Eclipse, Maven)
 ---

 Key: LUCENE-4302
 URL: https://issues.apache.org/jira/browse/LUCENE-4302
 Project: Lucene - Core
  Issue Type: Bug
  Components: general/javadocs
Affects Versions: 4.0-ALPHA
 Environment: Windows 7-64bit/Eclipse Juno (4.2)/Maven m2e 
 plugin/firefox latest
Reporter: Karl Nicholas
Priority: Minor
  Labels: documentation
   Original Estimate: 1h
  Remaining Estimate: 1h

 I have opened javadoc for Facet API while using Eclipse, which downloaded the 
 javadocs using Maven m2e plugin. When I click on facet User Guide on the 
 overview page I get the following exception in FireFox:
 http://127.0.0.1:49231/help/nftopic/jar:file:/C:/Users/karl/.m2/repository/org/apache/lucene/lucene-facet/4.0.0-ALPHA/
 lucene-facet-4.0.0-ALPHA-javadoc.jar!/org/apache/lucene/facet/doc-files/userguide.html
 An error occured while processing the requested document:
 org.xml.sax.SAXParseException; lineNumber: 121; columnNumber: 16; The element 
 type br must be terminated by the matching end-tag /br.
   at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown 
 Source)
   at 
 com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown 
 Source)
 The link, or requested document is:
 http://127.0.0.1:49231/help/nftopic/jar:file:/C:/Users/karl/.m2/repository/org/apache/lucene/lucene-facet/4.0.0-ALPHA/
 lucene-facet-4.0.0-ALPHA-javadoc.jar!/org/apache/lucene/facet/doc-files/userguide.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-22 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4234:
-

Attachment: (was: LUCENE-4234-fix.patch)

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
Assignee: Shai Erera
 Fix For: 4.0, 5.0, 3.6.2

 Attachments: LUCENE-4234.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-22 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4234:
-

Attachment: (was: LUCENE-4234-test.patch)

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
Assignee: Shai Erera
 Fix For: 4.0, 5.0, 3.6.2

 Attachments: LUCENE-4234.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-22 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4234:
-

Attachment: LUCENE-4234.patch

One patch (to rule them all) instead of the two. No changes were made thus far.

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
Assignee: Shai Erera
 Fix For: 4.0, 5.0, 3.6.2

 Attachments: LUCENE-4234.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-22 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13420125#comment-13420125
 ] 

Gilad Barkai commented on LUCENE-4234:
--

Hi Shai, thank you for reviewing.

1. Merged
2. Using FBS is possible, though it's a larger change than the fix, perhaps 
handle this in a seperate issue?
3. IMHO that's a waste for a large index. While allocating a large array is 
faster, it takes a lot of memory - probably for no reason. Each query will take 
longer, but concurrent queries will not hurt the GC that much. If the index is 
large concurrent queries might hit the GC hard, perhaps even OOM ? I must admit 
I never measured the reallocation penalty.

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
Assignee: Shai Erera
 Fix For: 4.0, 5.0, 3.6.2

 Attachments: LUCENE-4234.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-22 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13420144#comment-13420144
 ] 

Gilad Barkai commented on LUCENE-4234:
--

Thank you for looking into this and making the appropriate changes. 
Patch looks good.

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
Assignee: Shai Erera
 Fix For: 4.0, 5.0, 3.6.2

 Attachments: LUCENE-4234.patch, LUCENE-4234.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-18 Thread Gilad Barkai (JIRA)
Gilad Barkai created LUCENE-4234:


 Summary: Exception when FacetsCollector is used with 
ScoreFacetRequest
 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 4.0-ALPHA, 3.6
Reporter: Gilad Barkai
 Fix For: 3.6.1, 4.0


Aggregating facets with Lucene's Score using FacetsCollector results in an 
Exception (assertion when enabled).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-18 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4234:
-

Attachment: LUCENE-4234-test.patch

A test revealing the bug.

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
 Fix For: 3.6.1, 4.0

 Attachments: LUCENE-4234-test.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4234) Exception when FacetsCollector is used with ScoreFacetRequest

2012-07-18 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4234:
-

Attachment: LUCENE-4234-fix.patch

Proposed fix, the code was optimized for a capacity of 1000 documents, but 
BitSet was initizlized with the same value of 1000, causing .fastSet() to fail. 

Initializing the scores array to size of 64, and initializing the bitset to 
.maxDoc() as in the non-score-keeping version.

 Exception when FacetsCollector is used with ScoreFacetRequest
 -

 Key: LUCENE-4234
 URL: https://issues.apache.org/jira/browse/LUCENE-4234
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/facet
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Gilad Barkai
 Fix For: 3.6.1, 4.0

 Attachments: LUCENE-4234-fix.patch, LUCENE-4234-test.patch


 Aggregating facets with Lucene's Score using FacetsCollector results in an 
 Exception (assertion when enabled).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

2012-07-05 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406877#comment-13406877
 ] 

Gilad Barkai commented on LUCENE-4190:
--

Perhaps out of context, but here goes..
Users sometimes do stupid things, me included, such as putting the index in a 
non-dedicated-directory. But should they pay the penalty just because the code 
should not get overly complicated?

Codecs create their own files, and no one seems able to control what files they 
create (other than in assert?); Than, is it possible for the codec to handle 
the removal of the files it created? 

That would make codecs work the same way the Index handles the 'core' index 
files - each codec will be able to erase its own.
Another closely related option - let IW consult with the codecs about 
'non-core-files' and see which one should/could be removed.

I only suggest this because I fear for users' files which might get erased. 

Disclosure:
It'll be ages before I understand Lucene 4 half as much as I do Lucene 3.6 (not 
that that's much), so forgive me if I stepped on anyone's toes, or just 
described how to implement a time  machine :)

 IndexWriter deletes non-Lucene files
 

 Key: LUCENE-4190
 URL: https://issues.apache.org/jira/browse/LUCENE-4190
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4190.patch, LUCENE-4190.patch


 Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog 
 post: 
 http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
 IndexWriter will now (as of 4.0) delete all foreign files from the index 
 directory.  We made this change because Codecs are free to write to any files 
 now, so the space of filenames is hard to bound.
 But if the user accidentally uses the wrong directory (eg c:/) then we will 
 in fact delete important stuff.
 I think we can at least use some simple criteria (must start with _, maybe 
 must fit certain pattern eg _base36(_X).Y), so we are much less likely to 
 delete a non-Lucene file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4190) IndexWriter deletes non-Lucene files

2012-07-05 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407086#comment-13407086
 ] 

Gilad Barkai commented on LUCENE-4190:
--

{quote}
So if we insist on a perfect solution: then fine, the perfect solution I accept 
is for lucene to totally own
the directory, don't put files in there! Then the behavior is clear, no bugs, 
we delete everything.
{quote}

But than we're left with the original problem - should a poor user (say, me) 
accidentally put an index in an already filled directory (say /tmp) - the price 
to pay for is great.
Too great IMHO.

 IndexWriter deletes non-Lucene files
 

 Key: LUCENE-4190
 URL: https://issues.apache.org/jira/browse/LUCENE-4190
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Robert Muir
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4190.patch, LUCENE-4190.patch


 Carl Austin raised a good issue in a comment on my Lucene 4.0.0 alpha blog 
 post: 
 http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
 IndexWriter will now (as of 4.0) delete all foreign files from the index 
 directory.  We made this change because Codecs are free to write to any files 
 now, so the space of filenames is hard to bound.
 But if the user accidentally uses the wrong directory (eg c:/) then we will 
 in fact delete important stuff.
 I think we can at least use some simple criteria (must start with _, maybe 
 must fit certain pattern eg _base36(_X).Y), so we are much less likely to 
 delete a non-Lucene file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3409) NRT reader/writer over RAMDirectory memory leak

2011-08-31 Thread Gilad Barkai (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094388#comment-13094388
 ] 

Gilad Barkai commented on LUCENE-3409:
--

This issue is relevant for trunk as well.
Please update the Affected versions accordingly.

 NRT reader/writer over RAMDirectory memory leak
 ---

 Key: LUCENE-3409
 URL: https://issues.apache.org/jira/browse/LUCENE-3409
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.0.2, 3.3
Reporter: tal steier

 with NRT reader/writer, emptying an index using:
 writer.deleteAll()
 writer.commit()
 doesn't release all allocated memory.
 for example the following code will generate a memory leak:
 /**
* Reveals a memory leak in NRT reader/writerbr
* 
* The following main() does 10K cycles of:
* ul
* liAdd 10K empty documents to index writer/li
* licommit()/li
* liopen NRT reader over the writer, and immediately close it/li
* lidelete all documents from the writer/li
* licommit changes to the writer/li
* /ul
* 
* Running with -Xmx256M results in an OOME after ~2600 cycles
*/
   public static void main(String[] args) throws Exception {
   RAMDirectory d = new RAMDirectory();
   IndexWriter w = new IndexWriter(d, new 
 IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer()));
   Document doc = new Document();
   
   for(int i = 0; i  1; i++) {
   for(int j = 0; j  1; ++j) {
   w.addDocument(doc);
   }
   w.commit();
   IndexReader.open(w, true).close();
   w.deleteAll();
   w.commit();
   }
   
   w.close();
   d.close();
   }   

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-3390) Incorrect sort by Numeric (double) values for documents missing the sorting field

2011-08-22 Thread Gilad Barkai (JIRA)
Incorrect sort by Numeric (double) values for documents missing the sorting 
field
-

 Key: LUCENE-3390
 URL: https://issues.apache.org/jira/browse/LUCENE-3390
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.3
Reporter: Gilad Barkai
Priority: Minor


While sorting results over a numeric double field, documents which do not 
contain a value for the sorting field seem to get 0 (ZERO) value in the sort. 
This behavior is unexpected, as zero is comparable to the rest of the values. 
A better solution would either be allowing the user to define such a 
non-value default, or always bring those document results as the last ones.

Example scenario:
Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
value.
Searching with MatchAllDocsQuery, with sort over that field in descending order 
yields the docid results of 0, 2, 1.

Example code:
public static void main(String[] args) throws Exception {
RAMDirectory d = new RAMDirectory();
IndexWriter w = new IndexWriter(d, new 
IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer()));

// 1st doc, value 3.5d
Document doc = new Document();
doc.add(new NumericField(f, Store.YES, true).setDoubleValue(3.5d));
w.addDocument(doc);

// 2nd doc, value of -10d
doc = new Document();
doc.add(new NumericField(f, Store.YES, true).setDoubleValue(-10d));
w.addDocument(doc);

// 3rd doc, no value at all
w.addDocument(new Document());
w.close();

IndexSearcher s = new IndexSearcher(d);
Sort sort = new Sort(new SortField(f, SortField.DOUBLE, true));
TopDocs td = s.search(new MatchAllDocsQuery(), 10, sort);
for (ScoreDoc sd : td.scoreDocs) {
System.out.println(sd.doc + :  + s.doc(sd.doc).get(f));
}
s.close();
d.close();
}
 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3390) Incorrect sort by Numeric values for documents missing the sorting field

2011-08-22 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-3390:
-

 Labels: double float int long numeric sort  (was: double numeric sort)
Description: 
While sorting results over a numeric field, documents which do not contain a 
value for the sorting field seem to get 0 (ZERO) value in the sort. (Tested 
against Double, Float, Int  Long numeric fields ascending and descending 
order).
This behavior is unexpected, as zero is comparable to the rest of the values. 
A better solution would either be allowing the user to define such a 
non-value default, or always bring those document results as the last ones.

Example scenario:
Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
value.
Searching with MatchAllDocsQuery, with sort over that field in descending order 
yields the docid results of 0, 2, 1.

Asking for the top 2 documents brings the document without any value as the 2nd 
result - which seems as a bug?

  was:
While sorting results over a numeric double field, documents which do not 
contain a value for the sorting field seem to get 0 (ZERO) value in the sort. 
This behavior is unexpected, as zero is comparable to the rest of the values. 
A better solution would either be allowing the user to define such a 
non-value default, or always bring those document results as the last ones.

Example scenario:
Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
value.
Searching with MatchAllDocsQuery, with sort over that field in descending order 
yields the docid results of 0, 2, 1.

While the document with the missing value does match the query, I would expect 
it to come last, as it is not comparable by the other documents. For example, 
asking for the top 2 documents brings the document without any value which 
seems as a bug?

Summary: Incorrect sort by Numeric values for documents missing the 
sorting field  (was: Incorrect sort by Numeric (double) values for documents 
missing the sorting field)

 Incorrect sort by Numeric values for documents missing the sorting field
 

 Key: LUCENE-3390
 URL: https://issues.apache.org/jira/browse/LUCENE-3390
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.3
Reporter: Gilad Barkai
Priority: Minor
  Labels: double, float, int, long, numeric, sort
 Attachments: SortByDouble.java


 While sorting results over a numeric field, documents which do not contain a 
 value for the sorting field seem to get 0 (ZERO) value in the sort. (Tested 
 against Double, Float, Int  Long numeric fields ascending and descending 
 order).
 This behavior is unexpected, as zero is comparable to the rest of the 
 values. A better solution would either be allowing the user to define such a 
 non-value default, or always bring those document results as the last ones.
 Example scenario:
 Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
 value.
 Searching with MatchAllDocsQuery, with sort over that field in descending 
 order yields the docid results of 0, 2, 1.
 Asking for the top 2 documents brings the document without any value as the 
 2nd result - which seems as a bug?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3390) Incorrect sort by Numeric (double) values for documents missing the sorting field

2011-08-22 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-3390:
-

Description: 
While sorting results over a numeric double field, documents which do not 
contain a value for the sorting field seem to get 0 (ZERO) value in the sort. 
This behavior is unexpected, as zero is comparable to the rest of the values. 
A better solution would either be allowing the user to define such a 
non-value default, or always bring those document results as the last ones.

Example scenario:
Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
value.
Searching with MatchAllDocsQuery, with sort over that field in descending order 
yields the docid results of 0, 2, 1.

While the document with the missing value does match the query, I would expect 
it to come last, as it is not comparable by the other documents. For example, 
asking for the top 2 documents brings the document without any value which 
seems as a bug?

  was:
While sorting results over a numeric double field, documents which do not 
contain a value for the sorting field seem to get 0 (ZERO) value in the sort. 
This behavior is unexpected, as zero is comparable to the rest of the values. 
A better solution would either be allowing the user to define such a 
non-value default, or always bring those document results as the last ones.

Example scenario:
Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
value.
Searching with MatchAllDocsQuery, with sort over that field in descending order 
yields the docid results of 0, 2, 1.

Example code:
public static void main(String[] args) throws Exception {
RAMDirectory d = new RAMDirectory();
IndexWriter w = new IndexWriter(d, new 
IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer()));

// 1st doc, value 3.5d
Document doc = new Document();
doc.add(new NumericField(f, Store.YES, true).setDoubleValue(3.5d));
w.addDocument(doc);

// 2nd doc, value of -10d
doc = new Document();
doc.add(new NumericField(f, Store.YES, true).setDoubleValue(-10d));
w.addDocument(doc);

// 3rd doc, no value at all
w.addDocument(new Document());
w.close();

IndexSearcher s = new IndexSearcher(d);
Sort sort = new Sort(new SortField(f, SortField.DOUBLE, true));
TopDocs td = s.search(new MatchAllDocsQuery(), 10, sort);
for (ScoreDoc sd : td.scoreDocs) {
System.out.println(sd.doc + :  + s.doc(sd.doc).get(f));
}
s.close();
d.close();
}
 


 Incorrect sort by Numeric (double) values for documents missing the sorting 
 field
 -

 Key: LUCENE-3390
 URL: https://issues.apache.org/jira/browse/LUCENE-3390
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.3
Reporter: Gilad Barkai
Priority: Minor
  Labels: double, numeric, sort
 Attachments: SortByDouble.java


 While sorting results over a numeric double field, documents which do not 
 contain a value for the sorting field seem to get 0 (ZERO) value in the sort. 
 This behavior is unexpected, as zero is comparable to the rest of the 
 values. A better solution would either be allowing the user to define such a 
 non-value default, or always bring those document results as the last ones.
 Example scenario:
 Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
 value.
 Searching with MatchAllDocsQuery, with sort over that field in descending 
 order yields the docid results of 0, 2, 1.
 While the document with the missing value does match the query, I would 
 expect it to come last, as it is not comparable by the other documents. For 
 example, asking for the top 2 documents brings the document without any value 
 which seems as a bug?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3390) Incorrect sort by Numeric (double) values for documents missing the sorting field

2011-08-22 Thread Gilad Barkai (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-3390:
-

Attachment: SortByDouble.java

example code

 Incorrect sort by Numeric (double) values for documents missing the sorting 
 field
 -

 Key: LUCENE-3390
 URL: https://issues.apache.org/jira/browse/LUCENE-3390
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.3
Reporter: Gilad Barkai
Priority: Minor
  Labels: double, numeric, sort
 Attachments: SortByDouble.java


 While sorting results over a numeric double field, documents which do not 
 contain a value for the sorting field seem to get 0 (ZERO) value in the sort. 
 This behavior is unexpected, as zero is comparable to the rest of the 
 values. A better solution would either be allowing the user to define such a 
 non-value default, or always bring those document results as the last ones.
 Example scenario:
 Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any 
 value.
 Searching with MatchAllDocsQuery, with sort over that field in descending 
 order yields the docid results of 0, 2, 1.
 Example code:
 public static void main(String[] args) throws Exception {
   RAMDirectory d = new RAMDirectory();
   IndexWriter w = new IndexWriter(d, new 
 IndexWriterConfig(Version.LUCENE_33, new KeywordAnalyzer()));
   
   // 1st doc, value 3.5d
   Document doc = new Document();
   doc.add(new NumericField(f, Store.YES, true).setDoubleValue(3.5d));
   w.addDocument(doc);
   
   // 2nd doc, value of -10d
   doc = new Document();
   doc.add(new NumericField(f, Store.YES, true).setDoubleValue(-10d));
   w.addDocument(doc);
   
   // 3rd doc, no value at all
   w.addDocument(new Document());
   w.close();
   IndexSearcher s = new IndexSearcher(d);
   Sort sort = new Sort(new SortField(f, SortField.DOUBLE, true));
   TopDocs td = s.search(new MatchAllDocsQuery(), 10, sort);
   for (ScoreDoc sd : td.scoreDocs) {
   System.out.println(sd.doc + :  + s.doc(sd.doc).get(f));
   }
   s.close();
   d.close();
 }
  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org