from:"Martijn van Groningen \(JIRA\)"

[jira] [Commented] (LUCENE-6572) Highlighter depends on analyzers-common

2018-10-17 Thread Martijn van Groningen (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-6572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653407#comment-16653407
 ] 

Martijn van Groningen commented on LUCENE-6572:
---

> do you have background on why we have support for these queries in 
>highlighting?

I don't recall why, but like the in comment you mention, it is weird to support 
this. A different highlight query should used for when child or parent 
documents need to be highlighed. Elasticsearch ignores parent/child queries 
when highlighting and expects that when users want highlighting for child docs 
that a highlight query should be specified.

> Highlighter depends on analyzers-common
> ---
>
> Key: LUCENE-6572
> URL: https://issues.apache.org/jira/browse/LUCENE-6572
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/highlighter
>Reporter: Robert Muir
>Assignee: Simon Willnauer
>Priority: Blocker
> Attachments: LUCENE-6572.patch, LUCENE-6572.patch
>
>
> This is a huge WTF, just for "LimitTokenOffsetFilter" which is only useful 
> for highlighting.
> Adding all these intermodule dependencies makes things too hard to use.
> This is a 5.3 release blocker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8152) Simplify conditionals in JoinUtil

2018-02-02 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350317#comment-16350317
 ] 

Martijn van Groningen commented on LUCENE-8152:
---

+1 That is much cleaner

> Simplify conditionals in JoinUtil 
> --
>
> Key: LUCENE-8152
> URL: https://issues.apache.org/jira/browse/LUCENE-8152
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Horatiu Lazu
>Priority: Trivial
> Attachments: LUCENE-8152.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following could be simplified, on line 249:
> {code:java}
> int dvDocID = numericDocValues.docID();
> if (dvDocID < doc) {
>   dvDocID = numericDocValues.advance(doc);
> }
> long value;
> if (dvDocID == doc) {
>   value = numericDocValues.longValue();
> } else {
>   value = 0;
> }
> {code}
> To:
> {code:java}
> long value = 0;
> if (numericDocValues.advanceExact(doc)) {
>   value = numericDocValues.longValue();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8120) Fix LatLonBoundingBox's toString() method

2018-01-11 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-8120.
---
   Resolution: Fixed
Fix Version/s: 7.3
   master (8.0)

> Fix LatLonBoundingBox's toString() method
> -
>
> Key: LUCENE-8120
> URL: https://issues.apache.org/jira/browse/LUCENE-8120
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
>Priority: Trivial
> Fix For: master (8.0), 7.3
>
> Attachments: LUCENE-8120.patch, LUCENE-8120.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8120) Fix LatLonBoundingBox's toString() method

2018-01-09 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-8120:
--
Attachment: LUCENE-8120.patch

bq. Let's maybe just give bounds back in the same order as they were passed in 
the constructor, separated by commas?

Yes, that is clearer. I've updated the patch to do this now.

> Fix LatLonBoundingBox's toString() method
> -
>
> Key: LUCENE-8120
> URL: https://issues.apache.org/jira/browse/LUCENE-8120
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE-8120.patch, LUCENE-8120.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8120) Fix LatLonBoundingBox's toString() method

2018-01-05 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-8120:
--
Attachment: LUCENE-8120.patch

> Fix LatLonBoundingBox's toString() method
> -
>
> Key: LUCENE-8120
> URL: https://issues.apache.org/jira/browse/LUCENE-8120
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE-8120.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8120) Fix LatLonBoundingBox's toString() method

2018-01-05 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-8120:
-

 Summary: Fix LatLonBoundingBox's toString() method
 Key: LUCENE-8120
 URL: https://issues.apache.org/jira/browse/LUCENE-8120
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Martijn van Groningen
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8104) Grouping module should no longer depend on Queries module (ValueSource)

2017-12-21 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299740#comment-16299740
 ] 

Martijn van Groningen commented on LUCENE-8104:
---

I think for the grouping module, for now, the ValueSourceGroupSelector class 
should be moved to the solr-core module and then the dependency on the queries 
module that the grouping module has can be removed, which is a big win on its 
own.

> Grouping module should no longer depend on Queries module (ValueSource)
> ---
>
> Key: LUCENE-8104
> URL: https://issues.apache.org/jira/browse/LUCENE-8104
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/grouping
>Reporter: David Smiley
> Attachments: LUCENE-8104.patch
>
>
> The Grouping module depends on the Queries module in GroupingSearch / 
> ValueSourceGroupSelector to use the ValueSource framework.  It should instead 
> use the newer DoubleValueSource or LongValueSource framework in Core.  As I 
> write this, this appears to be the last part of Lucene to refer to the 
> ValueSource framework, and I think we should then remove it -- for another 
> issue of course.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8055) MemoryIndex.MemoryDocValuesIterator returns 2 document instead of 1

2017-11-21 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260581#comment-16260581
 ] 

Martijn van Groningen commented on LUCENE-8055:
---

+1 thanks [~simonw]!

> MemoryIndex.MemoryDocValuesIterator returns 2 document instead of 1
> ---
>
> Key: LUCENE-8055
> URL: https://issues.apache.org/jira/browse/LUCENE-8055
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/other
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Fix For: master (8.0), 7.2, 7.1.1
>
> Attachments: LUCENE-8055.patch
>
>
> It there is a DV field in the MemoryIndex the 
> `MemoryIndex.MemoryDocValuesIterator` will return 2 documents instead of 1. 
> Simple off by one error and no tests. I have a patch ready for it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (LUCENE-7928) Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to public

2017-08-11 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen closed LUCENE-7928.
-
Resolution: Fixed

Ok, I agree, that this is an expert use case and that the api shouldn't have to 
change for that.

> Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to 
> public
> --
>
> Key: LUCENE-7928
> URL: https://issues.apache.org/jira/browse/LUCENE-7928
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Fix For: master (8.0), 7.1
>
> Attachments: LUCENE_7928.patch
>
>
> Changing the visibility of these classes to public can be useful for 
> accessing the getters (which are already public) to allow custom post 
> processing of the query instances.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7928) Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to public

2017-08-11 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123309#comment-16123309
 ] 

Martijn van Groningen commented on LUCENE-7928:
---

I agree that these queries should only be constructed via their factory 
methods, that is why the constructors are package protected. I like these two 
classes to be public to access the getters, which is useful for query 
processing. (which is what the percolator and luwak both do)

> Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to 
> public
> --
>
> Key: LUCENE-7928
> URL: https://issues.apache.org/jira/browse/LUCENE-7928
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Fix For: master (8.0), 7.1
>
> Attachments: LUCENE_7928.patch
>
>
> Changing the visibility of these classes to public can be useful for 
> accessing the getters (which are already public) to allow custom post 
> processing of the query instances.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7928) Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to public

2017-08-11 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7928:
--
Attachment: LUCENE_7928.patch

> Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to 
> public
> --
>
> Key: LUCENE-7928
> URL: https://issues.apache.org/jira/browse/LUCENE-7928
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Fix For: master (8.0), 7.1
>
> Attachments: LUCENE_7928.patch
>
>
> Changing the visibility of these classes to public can be useful for 
> accessing the getters (which are already public) to allow custom post 
> processing of the query instances.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7928) Change LatLonPointDistanceQuery and LatLonPointInPolygonQuery visibility to public

2017-08-11 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7928:
-

 Summary: Change LatLonPointDistanceQuery and 
LatLonPointInPolygonQuery visibility to public
 Key: LUCENE-7928
 URL: https://issues.apache.org/jira/browse/LUCENE-7928
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Martijn van Groningen
Priority: Minor
 Fix For: master (8.0), 7.1


Changing the visibility of these classes to public can be useful for accessing 
the getters (which are already public) to allow custom post processing of the 
query instances.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7889) Allow grouping on DoubleValuesSource ranges

2017-07-03 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16072431#comment-16072431
 ] 

Martijn van Groningen commented on LUCENE-7889:
---

Nice addition! Agreed that it lacks jdocs and more more thorough tests. Maybe 
also document this new way of grouping in {{package-info.java}}?

> Allow grouping on DoubleValuesSource ranges
> ---
>
> Key: LUCENE-7889
> URL: https://issues.apache.org/jira/browse/LUCENE-7889
> Project: Lucene - Core
>  Issue Type: New Feature
>Affects Versions: master (7.0)
>Reporter: Alan Woodward
>Assignee: Alan Woodward
> Attachments: LUCENE-7889.patch
>
>
> LUCENE-7701 made it easier to define new ways of grouping results.  This 
> issue adds functionality to group the values of a DoubleValuesSource into a 
> set of ranges.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7890) MemoryIndex should allow doc values iterator to be reset to the current docid

2017-06-29 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7890.
---
Resolution: Fixed

Pushed to master branch. I did not add an entry to CHANGES.txt, because this 
bug only existed in 7.0, which hasn't been released yet.

> MemoryIndex should allow doc values iterator to be reset to the current docid
> -
>
> Key: LUCENE-7890
> URL: https://issues.apache.org/jira/browse/LUCENE-7890
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: master (7.0)
>Reporter: Martijn van Groningen
> Attachments: LUCENE-7890.patch
>
>
> The `SortedSetDocValues` and `SortedNumericDocValues` instances returned by 
> the MemoryIndex should support subsequent `advanceExact(0)` invocations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7890) MemoryIndex should allow doc values iterator to be reset to the current docid

2017-06-28 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7890:
--
Attachment: LUCENE-7890.patch

Attached patch with fix and a test.

> MemoryIndex should allow doc values iterator to be reset to the current docid
> -
>
> Key: LUCENE-7890
> URL: https://issues.apache.org/jira/browse/LUCENE-7890
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: master (7.0)
>Reporter: Martijn van Groningen
> Attachments: LUCENE-7890.patch
>
>
> The `SortedSetDocValues` and `SortedNumericDocValues` instances returned by 
> the MemoryIndex should support subsequent `advanceExact(0)` invocations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7890) MemoryIndex should allow doc values iterator to be reset to the current docid

2017-06-28 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7890:
-

 Summary: MemoryIndex should allow doc values iterator to be reset 
to the current docid
 Key: LUCENE-7890
 URL: https://issues.apache.org/jira/browse/LUCENE-7890
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: master (7.0)
Reporter: Martijn van Groningen


The `SortedSetDocValues` and `SortedNumericDocValues` instances returned by the 
MemoryIndex should support subsequent `advanceExact(0)` invocations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2017-06-22 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE-7304.patch

Updated the patch. Added a more tests and cleaned up a bit.

To re-iterate what this patch does, this query uses both an indexed field and a 
doc values field. The doc values field is used when 
{{DocIdSetIterator#advance(...)}} is invoked to figure out what the first child 
is of a parent and then instruct the child iterator to advance to that first 
child. The doc values field has kind of the same purpose what the {{BitSet}} 
does for {{ToParentBlockJoinQuery}} query. The indexed field is used for normal 
forward advancing ({{DocIdSetIterator#nextDoc()}}).

I'm still unsure if this query should also use a doc values field for forward 
advancing. Each child would then store the offset to the next child. The last 
child's offset would be zero, meaning the parent is the next document. I think 
the upside with only using doc values fields is that validating that the docid 
block structure is correct is easier.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, 
> LUCENE-7304.patch, LUCENE-7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7571) TestJoinUtil.testSingleValueRandomJoin() failure

2017-06-19 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053727#comment-16053727
 ] 

Martijn van Groningen commented on LUCENE-7571:
---

The test failed doing a numeric join on a float point join field. Due to 
precision loss when converting an integer to a float during indexing more 
documents matched than was expected when testing the numeric join.

> TestJoinUtil.testSingleValueRandomJoin() failure
> 
>
> Key: LUCENE-7571
> URL: https://issues.apache.org/jira/browse/LUCENE-7571
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/join
>Reporter: Steve Rowe
>Assignee: Martijn van Groningen
>
> My Jenkins found a reproducing branch_6x seed:
> {noformat}
> Checking out Revision 500f6c7875be31c34ca68c58f850b7e64fd049a9 
> (refs/remotes/origin/branch_6x)
> [...]
>[junit4] Suite: org.apache.lucene.search.join.TestJoinUtil
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestJoinUtil 
> -Dtests.method=testSingleValueRandomJoin -Dtests.seed=D50603847B355BCB 
> -Dtests.slow=true -Dtests.locale=sq -Dtests.timezone=America/Indianapolis 
> -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
>[junit4] FAILURE 1.42s J0 | TestJoinUtil.testSingleValueRandomJoin <<<
>[junit4]> Throwable #1: java.lang.AssertionError: 
> expected: but 
> was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([D50603847B355BCB:39BE12CEF3F64714]:0)
>[junit4]>  at 
> org.apache.lucene.search.join.TestJoinUtil.assertBitSet(TestJoinUtil.java:1046)
>[junit4]>  at 
> org.apache.lucene.search.join.TestJoinUtil.executeRandomJoin(TestJoinUtil.java:1023)
>[junit4]>  at 
> org.apache.lucene.search.join.TestJoinUtil.testSingleValueRandomJoin(TestJoinUtil.java:938)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62), 
> sim=RandomSimilarity(queryNorm=false,coord=yes): 
> {productId=ClassicSimilarity, field=DFR I(F)2, price=DFR GB1, subtitle=DFR 
> I(n)L3(800.0), name=DFR G1, description=DFR GL2, from=DFR GB2, movieId=IB 
> LL-L2, id=DFR I(ne)L1, to=DFR I(ne)BZ(0.3), type=DFR I(n)L2, value=DFR 
> I(ne)2}, locale=sq, timezone=America/Indianapolis
>[junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
> 1.8.0_77 (64-bit)/cpus=16,threads=1,free=495200800,total=522715136
>[junit4]   2> NOTE: All tests run in this JVM: [TestJoinUtil]
>[junit4] Completed [5/6 (1!)] on J0 in 9.20s, 13 tests, 1 failure <<< 
> FAILURES!
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7869) MemoryIndex should sort 1d points

2017-06-08 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7869:
--
Fix Version/s: 6.6.1

> MemoryIndex should sort 1d points
> -
>
> Key: LUCENE-7869
> URL: https://issues.apache.org/jira/browse/LUCENE-7869
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
> Fix For: master (7.0), 6.7, 6.6.1
>
> Attachments: LUCENE_7869.patch
>
>
> In case of 1d points, the {{PointInSetQuery.MergePointVisitor}} expects that 
> these points are visited in ascending order. The memory index doesn't do this 
> and this can result in document with multiple points that should match to not 
> match. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7869) MemoryIndex should sort 1d points

2017-06-08 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7869.
---
Resolution: Fixed

> MemoryIndex should sort 1d points
> -
>
> Key: LUCENE-7869
> URL: https://issues.apache.org/jira/browse/LUCENE-7869
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE_7869.patch
>
>
> In case of 1d points, the {{PointInSetQuery.MergePointVisitor}} expects that 
> these points are visited in ascending order. The memory index doesn't do this 
> and this can result in document with multiple points that should match to not 
> match. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7869) MemoryIndex should sort 1d points

2017-06-08 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7869:
--
Fix Version/s: 6.7
   master (7.0)

> MemoryIndex should sort 1d points
> -
>
> Key: LUCENE-7869
> URL: https://issues.apache.org/jira/browse/LUCENE-7869
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE_7869.patch
>
>
> In case of 1d points, the {{PointInSetQuery.MergePointVisitor}} expects that 
> these points are visited in ascending order. The memory index doesn't do this 
> and this can result in document with multiple points that should match to not 
> match. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-7571) TestJoinUtil.testSingleValueRandomJoin() failure

2017-06-07 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen reassigned LUCENE-7571:
-

Assignee: Martijn van Groningen

> TestJoinUtil.testSingleValueRandomJoin() failure
> 
>
> Key: LUCENE-7571
> URL: https://issues.apache.org/jira/browse/LUCENE-7571
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/join
>Reporter: Steve Rowe
>Assignee: Martijn van Groningen
>
> My Jenkins found a reproducing branch_6x seed:
> {noformat}
> Checking out Revision 500f6c7875be31c34ca68c58f850b7e64fd049a9 
> (refs/remotes/origin/branch_6x)
> [...]
>[junit4] Suite: org.apache.lucene.search.join.TestJoinUtil
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestJoinUtil 
> -Dtests.method=testSingleValueRandomJoin -Dtests.seed=D50603847B355BCB 
> -Dtests.slow=true -Dtests.locale=sq -Dtests.timezone=America/Indianapolis 
> -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
>[junit4] FAILURE 1.42s J0 | TestJoinUtil.testSingleValueRandomJoin <<<
>[junit4]> Throwable #1: java.lang.AssertionError: 
> expected: but 
> was:
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([D50603847B355BCB:39BE12CEF3F64714]:0)
>[junit4]>  at 
> org.apache.lucene.search.join.TestJoinUtil.assertBitSet(TestJoinUtil.java:1046)
>[junit4]>  at 
> org.apache.lucene.search.join.TestJoinUtil.executeRandomJoin(TestJoinUtil.java:1023)
>[junit4]>  at 
> org.apache.lucene.search.join.TestJoinUtil.testSingleValueRandomJoin(TestJoinUtil.java:938)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62), 
> sim=RandomSimilarity(queryNorm=false,coord=yes): 
> {productId=ClassicSimilarity, field=DFR I(F)2, price=DFR GB1, subtitle=DFR 
> I(n)L3(800.0), name=DFR G1, description=DFR GL2, from=DFR GB2, movieId=IB 
> LL-L2, id=DFR I(ne)L1, to=DFR I(ne)BZ(0.3), type=DFR I(n)L2, value=DFR 
> I(ne)2}, locale=sq, timezone=America/Indianapolis
>[junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
> 1.8.0_77 (64-bit)/cpus=16,threads=1,free=495200800,total=522715136
>[junit4]   2> NOTE: All tests run in this JVM: [TestJoinUtil]
>[junit4] Completed [5/6 (1!)] on J0 in 9.20s, 13 tests, 1 failure <<< 
> FAILURES!
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7869) MemoryIndex should sort 1d points

2017-06-07 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7869:
--
Attachment: LUCENE_7869.patch

Attached a patch with a fix and a test.

> MemoryIndex should sort 1d points
> -
>
> Key: LUCENE-7869
> URL: https://issues.apache.org/jira/browse/LUCENE-7869
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
> Attachments: LUCENE_7869.patch
>
>
> In case of 1d points, the {{PointInSetQuery.MergePointVisitor}} expects that 
> these points are visited in ascending order. The memory index doesn't do this 
> and this can result in document with multiple points that should match to not 
> match. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7869) MemoryIndex should sort 1d points

2017-06-07 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7869:
-

 Summary: MemoryIndex should sort 1d points
 Key: LUCENE-7869
 URL: https://issues.apache.org/jira/browse/LUCENE-7869
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Martijn van Groningen


In case of 1d points, the {{PointInSetQuery.MergePointVisitor}} expects that 
these points are visited in ascending order. The memory index doesn't do this 
and this can result in document with multiple points that should match to not 
match. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2017-05-24 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE-7304.patch

It has been a while, but I had some time to get back to this. Updated patch to 
all changes that have happened so far in master (iterator based doc values api, 
two phase query execution and score supplier).

I ran the same performance test as before and due to doc values compression, 
the offset field now takes 337387 bytes instead of 839592 bytes before, which 
is good!

I'm still thinking about other ways of encoding the block of documents. Right 
now the parent document have a doc values field with the offset to the first 
child docid. Instead child documents can have a doc values field with the 
offset to its parent docid. That way parent doc can be indexed first before the 
child docs.



> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, 
> LUCENE-7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-23 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021681#comment-16021681
 ] 

Martijn van Groningen commented on LUCENE-7810:
---

The change has been backported to 6.6 branch too now.

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Fix For: master (7.0), 6.6, 6.7
>
> Attachments: LUCENE_7810.patch, LUCENE_7810.patch, LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-23 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7810:
--
Fix Version/s: 6.6

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Fix For: master (7.0), 6.6, 6.7
>
> Attachments: LUCENE_7810.patch, LUCENE_7810.patch, LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-23 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7810.
---
   Resolution: Fixed
Fix Version/s: 6.7
   master (7.0)

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Fix For: master (7.0), 6.7
>
> Attachments: LUCENE_7810.patch, LUCENE_7810.patch, LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-19 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7810:
--
Attachment: LUCENE_7810.patch

[~jpountz] I've updated the patch. Score mode is now taken into account in 
equals(...) and hashcode(...) methods and in case a scoring query is used when 
no scores are needed then it the query gets replaced with the non scoring 
variant.

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Attachments: LUCENE_7810.patch, LUCENE_7810.patch, LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-17 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16014103#comment-16014103
 ] 

Martijn van Groningen commented on LUCENE-7810:
---

>  If we want to be able te reuse cache entries that have different score 
> modes, we could rewrite to a TermsQuery in createWeight, similarly to how 
> BooleanQuery rewrites all MUST clauses into FILTER clauses when needsScores 
> is false?

Good idea. I'll make this change.

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Attachments: LUCENE_7810.patch, LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-15 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7810:
--
Attachment: LUCENE_7810.patch

Added patch based on [~hossman]'s patch that adds more tests and fixes 
{{equals()}} method in {{TermsIncludingScoreQuery}} and {{TermsQuery}}. The 
global ordinal based queries did already implement {{equals()}} correctly and 
the numeric join's {{equals()}} method was also working correctly because it is 
comparing the actual collected points.

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Attachments: LUCENE_7810.patch, LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7810) false positive equality: distinctly diff join queries return equals()==true

2017-05-12 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008212#comment-16008212
 ] 

Martijn van Groningen commented on LUCENE-7810:
---

Good catch [~hossman]! If nobody is working on this then I can fix this bug. So 
far only the {{TermsQuery}} seems to not take into account the join field.

bq. you mean the from query, correct?

I think this is what [~jpountz] means, because equality checking the collected 
terms would be too expensive. 

I think the {{TermsIncludingScoreQuery}}, {{TermsQuery}}, 
{{PointInSetIncludingScoreQuery}} and {{PointInSetQuery}} should also take 
index reader context id into like the {{GlobalOrdinalsQuery}} is doing. 
Otherwise the fromQuery + join field key can still be invalid. (mainly when 
docs on the from side added) 

> false positive equality: distinctly diff join queries return equals()==true
> ---
>
> Key: LUCENE-7810
> URL: https://issues.apache.org/jira/browse/LUCENE-7810
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Hoss Man
> Attachments: LUCENE-7810.patch
>
>
> While working on SOLR-10583 I was getting some odd test failures that seemed 
> to suggest we were getting false cache hits for Join queries that should have 
> been unique.
> tracing thorugh the code, the problem seems to be the way {{TermsQuery}} 
> implements {{equals(Object)}}.  This class takes in the {{fromQuery}} (used 
> to identify set of documents we "join from") and uses it in the equals 
> calculation -- but the information about the join _field_ is never passed 
> directly to {{TermsQuery}} and the BytesRefs that are passed in can't be 
> compared efficiently (AFAICT), so 2 completely diff calls to 
> {{JoinUtils.createJoinQuery(...)}} can result in Query objects that think 
> they are {{equal()}} even when they most certainly are not.
> At a brief glance, it appears that similar bugs exist in 
> {{TermsIncludingScoreQuery}} (and possibly {{GlobalOrdinalsWithScoreQuery}}, 
> but i didn't look into that class at all)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7798) add equals/hashCode to ToParentBlockJoinSortField

2017-04-28 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15988351#comment-15988351
 ] 

Martijn van Groningen commented on LUCENE-7798:
---

+1

> add equals/hashCode to ToParentBlockJoinSortField
> -
>
> Key: LUCENE-7798
> URL: https://issues.apache.org/jira/browse/LUCENE-7798
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/join
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
> Attachments: LUCENE-7798.patch
>
>
> Since SOLR-10521 {{ToParentBlockJoinSortField}} is going to be used as query 
> result key, therefore it's worth to implemend proper equality methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7701) Refactor grouping collectors

2017-04-07 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960652#comment-15960652
 ] 

Martijn van Groningen commented on LUCENE-7701:
---

+1 I think this is a good change. I agree it should 7.0 only.

> Refactor grouping collectors
> 
>
> Key: LUCENE-7701
> URL: https://issues.apache.org/jira/browse/LUCENE-7701
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
> Attachments: LUCENE-7701.patch, LUCENE-7701.patch
>
>
> Grouping currently works via abstract collectors, which need to be overridden 
> for each way of defining a group - currently we have two, 'term' (based on 
> SortedDocValues) and 'function' (based on ValueSources).  These collectors 
> all have a lot of repeated code, and means that if you want to implement your 
> own group definitions, you need to override four or five different classes.
> This would be easier to deal with if instead the 'group selection' code was 
> abstracted out into a single interface, and the various collectors were 
> changed to concrete implementations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7701) Refactor grouping collectors

2017-03-31 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950977#comment-15950977
 ] 

Martijn van Groningen commented on LUCENE-7701:
---

Sorry for the late reply [~romseygeek]! I general I agree with this 
refactoring. It is a better design. I'll look more closely at this patch next 
week.

> Refactor grouping collectors
> 
>
> Key: LUCENE-7701
> URL: https://issues.apache.org/jira/browse/LUCENE-7701
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
> Attachments: LUCENE-7701.patch, LUCENE-7701.patch
>
>
> Grouping currently works via abstract collectors, which need to be overridden 
> for each way of defining a group - currently we have two, 'term' (based on 
> SortedDocValues) and 'function' (based on ValueSources).  These collectors 
> all have a lot of repeated code, and means that if you want to implement your 
> own group definitions, you need to override four or five different classes.
> This would be easier to deal with if instead the 'group selection' code was 
> abstracted out into a single interface, and the various collectors were 
> changed to concrete implementations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7755) Join queries should not reference IndexReaders.

2017-03-28 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945343#comment-15945343
 ] 

Martijn van Groningen commented on LUCENE-7755:
---

+1 good catch!

> Join queries should not reference IndexReaders.
> ---
>
> Key: LUCENE-7755
> URL: https://issues.apache.org/jira/browse/LUCENE-7755
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
> Attachments: LUCENE-7755.patch
>
>
> This is similar to LUCENE-7657 and can cause memory leaks when those queries 
> are cached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7681) Remove LegacyDocValues implementations from MemoryIndex

2017-02-10 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861231#comment-15861231
 ] 

Martijn van Groningen commented on LUCENE-7681:
---

+1 looks good!

> Remove LegacyDocValues implementations from MemoryIndex
> ---
>
> Key: LUCENE-7681
> URL: https://issues.apache.org/jira/browse/LUCENE-7681
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: master (7.0)
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-7681.patch
>
>
> MemoryIndex in master is using the LegacyDocValue wrappers.  We should 
> replace these with plain 7.0-style iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7685) Remove equals/rewrite hacks from block join queries

2017-02-10 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861219#comment-15861219
 ] 

Martijn van Groningen commented on LUCENE-7685:
---

+1

> Remove equals/rewrite hacks from block join queries
> ---
>
> Key: LUCENE-7685
> URL: https://issues.apache.org/jira/browse/LUCENE-7685
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7685.patch
>
>
> These queries try to ensure that rewritten queries are equal to the original 
> query by keeping around the original query that was used to instantiate the 
> join query. However this does not buy anything, and could even prevent two 
> queries that rewrite to the same form to be considered equals.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7684) MemoryIndex should store payloads per-field

2017-02-09 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859688#comment-15859688
 ] 

Martijn van Groningen commented on LUCENE-7684:
---

+1 looks good!

> MemoryIndex should store payloads per-field
> ---
>
> Key: LUCENE-7684
> URL: https://issues.apache.org/jira/browse/LUCENE-7684
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
> Attachments: LUCENE-7684.patch
>
>
> Currently MemoryIndex will store payloads for all fields, or for none.  It 
> would be useful instead for it to store them per-field.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7679) MemoryIndex.addField() ignores some FieldType settings

2017-02-07 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856116#comment-15856116
 ] 

Martijn van Groningen commented on LUCENE-7679:
---

+1

> MemoryIndex.addField() ignores some FieldType settings
> --
>
> Key: LUCENE-7679
> URL: https://issues.apache.org/jira/browse/LUCENE-7679
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
> Attachments: LUCENE-7679.patch
>
>
> Spotted by a luwak user: https://github.com/flaxsearch/luwak/issues/135.  
> MemoryIndex never omits norms, which means that it can produce incorrect 
> scores.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] (LUCENE-7665) Remove grouping dependency from the join module

2017-01-30 Thread Martijn van Groningen (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Martijn van Groningen resolved as Fixed 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Lucene - Core /  LUCENE-7665 
 
 
 
  Remove grouping dependency from the join module  
 
 
 
 
 
 
 
 
 

Change By:
 
 Martijn van Groningen 
 
 
 

Resolution:
 
 Fixed 
 
 
 

Status:
 
 Open Resolved 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (LUCENE-7665) Remove grouping dependency from the join module

2017-01-29 Thread Martijn van Groningen (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Martijn van Groningen updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Lucene - Core /  LUCENE-7665 
 
 
 
  Remove grouping dependency from the join module  
 
 
 
 
 
 
 
 
 

Change By:
 
 Martijn van Groningen 
 
 
 
 
 
 
 
 
 
 Follow up from LUCENE-6959. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (LUCENE-7665) Remove grouping dependency from the join module

2017-01-29 Thread Martijn van Groningen (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Martijn van Groningen updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Lucene - Core /  LUCENE-7665 
 
 
 
  Remove grouping dependency from the join module  
 
 
 
 
 
 
 
 
 
 
Attached patch that removes the grouping dependency from join module. 
 
 
 
 
 
 
 
 
 

Change By:
 
 Martijn van Groningen 
 
 
 

Attachment:
 
 LUCENE_7665.patch 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (LUCENE-7665) Remove grouping dependency from the join module

2017-01-29 Thread Martijn van Groningen (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Martijn van Groningen created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Lucene - Core /  LUCENE-7665 
 
 
 
  Remove grouping dependency from the join module  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Improvement 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Created:
 

 29/Jan/17 21:54 
 
 
 

Fix Versions:
 

 master (7.0), 6.5 
 
 
 

Priority:
 
  Minor 
 
 
 

Reporter:
 
 Martijn van Groningen 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-29 Thread Martijn van Groningen (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Martijn van Groningen resolved as Fixed 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Lucene - Core /  LUCENE-6959 
 
 
 
  Remove ToParentBlockJoinCollector  
 
 
 
 
 
 
 
 
 

Change By:
 
 Martijn van Groningen 
 
 
 

Resolution:
 
 Fixed 
 
 
 

Status:
 
 Open Resolved 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] [Updated] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-28 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-6959:
--
Attachment: LUCENE_6959.patch

[~mikemccand] I added back the child hit checking in TestBlockJoin test suite.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE_6959.patch, 
> LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-27 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842806#comment-15842806
 ] 

Martijn van Groningen commented on LUCENE-6959:
---

bq. I can take a crack at putting back some of the child hit checking there, if 
you all haven't started on that yet?

I can add that back.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE_6959.patch, 
> LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-27 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-6959:
--
Attachment: LUCENE_6959.patch

I've updated the patch. Thanks for reviewing!

bq. should it take the childQuery into account for equals/hashcode?

Oops, I forgot to add that back when removing `origChildQuery `.

bq. it looks buggy to me that we do not convert parentDocId to 
parentDocId-context.docBase in the scorer?

Good catch. I didn't catch this in the initially, but after running the 
provided test in the patch a 100 times it did fail, because the `parentDocId` 
wasn't converted.

bq. you use ConstantScoreWeight but then return a Scorer that actually scores, 
you should extend Weight directly instead.

Good point, I've changed that.

bq. Let's remove the "we"?

Done.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE_6959.patch, 
> LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-6959:
--
Attachment: LUCENE_6959.patch

Removed `origChildQuery ` field.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE_6959.patch, 
> LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839694#comment-15839694
 ] 

Martijn van Groningen commented on LUCENE-6959:
---

I was wrong, it can `origChildQuery` can be removed as rewritten queries are 
used as cache key in query cache.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839694#comment-15839694
 ] 

Martijn van Groningen edited comment on LUCENE-6959 at 1/26/17 1:51 PM:


I was wrong, `origChildQuery` can be removed as rewritten queries are used as 
cache key in query cache.


was (Author: martijn.v.groningen):
I was wrong, it can `origChildQuery` can be removed as rewritten queries are 
used as cache key in query cache.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839635#comment-15839635
 ] 

Martijn van Groningen commented on LUCENE-6959:
---

bq. Do we really need the {{origChildQuery}} anymore?

In case the query get rewritten and if it happened to get cached?

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839635#comment-15839635
 ] 

Martijn van Groningen edited comment on LUCENE-6959 at 1/26/17 12:26 PM:
-

bq. Do we really need the {{origChildQuery}} anymore?

In case the query gets rewritten and if the query happens to get cached?


was (Author: martijn.v.groningen):
bq. Do we really need the {{origChildQuery}} anymore?

In case the query get rewritten and if it happened to get cached?

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839436#comment-15839436
 ] 

Martijn van Groningen commented on LUCENE-6959:
---

I wonder if we should continue using TopGroups in the TestBlockJoin test case? 
If we stop using it then we can remove the the dependency on the grouping 
module the join module has.

The `TopGroups` has logic for merging other groups, but in case for block join 
there should be no need for that as the parent and its children are always in 
the same Lucene index. That makes sense, because in grouping arbitrary document 
can belong to a group.

Mike, would just using a ScoreDoc with TopDocs be sufficient to represent a 
jira issue with comments in jirasearch?

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-6959:
--
Attachment: LUCENE_6959.patch

I've changed the ParentChildrenBlockJoinQuery query, so that it no longer 
requires to accept a LeafReader and it figures out by itself which leaf reader 
the parent doc is in.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839408#comment-15839408
 ] 

Martijn van Groningen commented on LUCENE-6959:
---

bq. Can we maybe change the new query to instead hold the parent's docID in the 
top-level reader's space, and then in the scorer method, check the incoming 
reader context to see if this is the segment that holds the parent? This would 
also simplify usage, so users wouldn't have to create their own weights? Then I 
think you don't need the LeafReader reference.

+1 That is much better.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-25 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837714#comment-15837714
 ] 

Martijn van Groningen edited comment on LUCENE-6959 at 1/25/17 1:19 PM:


I've modified Adrien's patch and ported the ES query mentioned in my previous 
comment to Lucene join module.

bq. I'd like to understand a bit better how exactly we can re-implement this 
functionality once we remove the collector. That ES query class seems to be 
created for each parent doc that made the top N hits, right?

Yes, that is what it is doing.


was (Author: martijn.v.groningen):
I've modified Adrien's patch and ported the ES query mentioned in my previous 
comment to Lucene join module.

> I'd like to understand a bit better how exactly we can re-implement this 
> functionality once we remove the collector. That ES query class seems to be 
> created for each parent doc that made the top N hits, right?

Yes, that is what it is doing.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-25 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-6959:
--
Attachment: LUCENE_6959.patch

I've modified Adrien's patch and ported the ES query mentioned in my previous 
comment to Lucene join module.

> I'd like to understand a bit better how exactly we can re-implement this 
> functionality once we remove the collector. That ES query class seems to be 
> created for each parent doc that made the top N hits, right?

Yes, that is what it is doing.

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE_6959.patch, LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6959) Remove ToParentBlockJoinCollector

2017-01-24 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836779#comment-15836779
 ] 

Martijn van Groningen commented on LUCENE-6959:
---

+1 To remove this collector in the master and 6x branches.

As a follow up issue we can add back to ability to include child docs, but in a 
different way than is done today. A subsequent search after the main search, 
that selects child docs for specific parents. For example like is done here: 
https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/fetch/subphase/InnerHitsContext.java#L160

> Remove ToParentBlockJoinCollector
> -
>
> Key: LUCENE-6959
> URL: https://issues.apache.org/jira/browse/LUCENE-6959
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-6959.patch
>
>
> This collector uses the getWeight() and getChildren() methods from the passed 
> in Scorer, which are not always available (eg. disjunctions expose fake 
> scorers) hence the need for a dedicated IndexSearcher 
> (ToParentBlockJoinIndexSearcher). Given that this is the only collector in 
> this case, I would like to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7617) Improve GroupingSearch API and extensibility

2017-01-06 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805662#comment-15805662
 ] 

Martijn van Groningen commented on LUCENE-7617:
---

+1 Thanks for cleaning this up!

I found a few places still using GROUP_VALUE_TYPE, in 
SecondPassGroupingCollector.SearchGroupDocs, GroupDocs, TopGroups, 
AllGroupHeadsCollector.GroupHead and Grouping.Command (in Solr).

bq. Given that everything here is marked as experimental, I think we're OK to 
just backwards-break?

Yes, that is OK. 



> Improve GroupingSearch API and extensibility
> 
>
> Key: LUCENE-7617
> URL: https://issues.apache.org/jira/browse/LUCENE-7617
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-7617.patch, LUCENE-7617.patch
>
>
> While looking at how to make grouping work with the new XValuesSource API in 
> core, I thought I'd try and clean up GroupingSearch a bit.  We have three 
> different ways of grouping at the moment: by doc block, using a single-pass 
> collector; by field; and by ValueSource.  The latter two both use essentially 
> the same two-pass mechanism, with different Collector implementations.
> I can see a number of possible improvements here:
> * abstract the two-pass collector creation into a factory API, which should 
> allow us to add the XValuesSource implementations as well
> * clean up the generics on the two-pass collectors - maybe look into removing 
> them entirely?  I'm not sure they add anything really, and we don't have them 
> on the equivalent plan search APIs
> * think about moving the document block method into the join module instead, 
> alongside all the other block-indexing code
> * rename the various Collector base classes so that they don't have 
> 'Abstract' in them anymore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7617) Improve GroupingSearch API and extensibility

2017-01-05 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15800777#comment-15800777
 ] 

Martijn van Groningen commented on LUCENE-7617:
---

+1 to this change. This should make using these collectors easier.

There are a couple of places where I saw if statements without curly brackets. 
Maybe add these curly brackets. I find it easier to read.

bq. clean up the generics on the two-pass collectors - maybe look into removing 
them entirely?

As far as I can see the bases classes use these generics, so that subclasses 
don't have to do manual casts. Which parts you like to cleanup?

bq. think about moving the document block method into the join module instead, 
alongside all the other block-indexing code

I would prefer if the `BlockGroupingCollector` stayed in the grouping module. 
The block indexing is a feature provided by core and the way I see it modules 
can have features that use that. Also the the join and grouping modules provide 
each a different functionality. Although from a higher level the functionality 
is a bit overlapping, in a sense that some use cases could be implemented with 
both the join or the grouping module.

bq. rename the various Collector base classes so that they don't have 
'Abstract' in them anymore

agreed, a lot 'Abstract' in the names :)

> Improve GroupingSearch API and extensibility
> 
>
> Key: LUCENE-7617
> URL: https://issues.apache.org/jira/browse/LUCENE-7617
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-7617.patch
>
>
> While looking at how to make grouping work with the new XValuesSource API in 
> core, I thought I'd try and clean up GroupingSearch a bit.  We have three 
> different ways of grouping at the moment: by doc block, using a single-pass 
> collector; by field; and by ValueSource.  The latter two both use essentially 
> the same two-pass mechanism, with different Collector implementations.
> I can see a number of possible improvements here:
> * abstract the two-pass collector creation into a factory API, which should 
> allow us to add the XValuesSource implementations as well
> * clean up the generics on the two-pass collectors - maybe look into removing 
> them entirely?  I'm not sure they add anything really, and we don't have them 
> on the equivalent plan search APIs
> * think about moving the document block method into the join module instead, 
> alongside all the other block-indexing code
> * rename the various Collector base classes so that they don't have 
> 'Abstract' in them anymore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7418) remove legacy numerics from join/ and queryparser/

2016-08-19 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427829#comment-15427829
 ] 

Martijn van Groningen commented on LUCENE-7418:
---

+1 to the changes in join

> remove legacy numerics from join/ and queryparser/
> --
>
> Key: LUCENE-7418
> URL: https://issues.apache.org/jira/browse/LUCENE-7418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
> Attachments: LUCENE-7418.patch
>
>
> We have three modules with (temporary) dependency on backwards codecs:
> * join/
> * queryparser/
> * spatial-extras/
> this patch handles the first two, as they are easy. spatial-extras is more 
> complex as its legacy support is not clearly separated, so i'm not trying to 
> address that here.
> For join/ we just remove deprecations. For queryparser/, same thing, except 
> since solr exposes the xml queryparser, i moved the LegacyRangeQueryBuilder 
> to solr and hooked it into its subclass of the parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7394) Make MemoryIndex immutable

2016-07-25 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392121#comment-15392121
 ] 

Martijn van Groningen commented on LUCENE-7394:
---

bq. We'd still need a way to set a Similarity so that we can encode norms, I 
think?

Yes, but that should be set before adding fields as a constructor parameter.

bq. The tricky part here is going to be untangling the various shared block 
pools. We need to make sure that calling .addField() doesn't change the data 
referenced by a previously created IndexReader, which is where I got stuck last 
time I tried playing around with this idea.

I think we should avoid sharing shared block pools between IndexReader 
instances, this makes it hard (impossible?) to make MemoryIndex immutable and 
cleanup this class. To be clear about this we should from an usage / api 
perspective MemoryIndex should be renamed to MemoryIndexBuilder (that has 
constructor that accepts a Similarity) and has two methods (addField(...) and 
build(), after build has been invoked calling addField(...) will fail).

Later on we can investigate some kind of a reuse by adding an extra constructor 
to MemoryIndexBuilder that accepts an IndexReader. This would make copy of the 
previous create MemoryIndex and where possible shallow copies / clones of the 
previous created data structures.

> Make MemoryIndex immutable
> --
>
> Key: LUCENE-7394
> URL: https://issues.apache.org/jira/browse/LUCENE-7394
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>
> The MemoryIndex itself should just be a builder that constructs an 
> IndexReader instance. The whole notion of freezing a memory index should be 
> removed.
> While we change this we should also clean this class up. There are many 
> methods to add a field, we should just have a single method that accepts a 
> `IndexableField`.
> The `keywordTokenStream(...)` method is unused and untested and should be 
> removed and it doesn't belong with the memory index.
> The `setSimilarity(...)`, `createSearcher(...)` and `search(...)` methods 
> should be removed, because the MemoryIndex should just be responsible for 
> creating an IndexReader instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7391) MemoryIndexReader.fields() performance regression

2016-07-25 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391441#comment-15391441
 ] 

Martijn van Groningen commented on LUCENE-7391:
---

bq. +1 - freeze() was a hack, and I've been meaning to open an issue to make 
things properly immutable for ages.

[~romseygeek] Then lets try to fix this in master :) I've opened LUCENE-7394 to 
track this.


> MemoryIndexReader.fields() performance regression
> -
>
> Key: LUCENE-7391
> URL: https://issues.apache.org/jira/browse/LUCENE-7391
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Mason
>Assignee: David Smiley
> Attachments: LUCENE-7391-test.patch, LUCENE-7391.patch, 
> LUCENE-7391.patch
>
>
> While upgrading our codebase from Lucene 4 to Lucene 6 we found a significant 
> performance regression - a 5x slowdown
> On profiling the code, the method MemoryIndexReader.fields() shows up as one 
> of the hottest methods
> Looking at the method, it just creates a copy of the inner {{fields}} Map 
> before passing it to {{MemoryFields}}. It does this so that it can filter out 
> fields with {{numTokens <= 0}}.
> The simplest "fix" would be to just remove the copying of the map completely, 
> and pass {{fields}} directly to {{MemoryFields}}.  It's simple and removes 
> any slowdown caused by this method.  It does potentially change behaviour 
> though, but none of the unit tests seem to test that behaviour so I wonder 
> whether it's necessary (I looked at the original ticket LUCENE-7091 that 
> introduced this code, I can't find much in way of an explanation). I'm going 
> to attach a patch to this effect anyway and we can take things from there



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7394) Make MemoryIndex immutable

2016-07-25 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7394:
-

 Summary: Make MemoryIndex immutable
 Key: LUCENE-7394
 URL: https://issues.apache.org/jira/browse/LUCENE-7394
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Martijn van Groningen


The MemoryIndex itself should just be a builder that constructs an IndexReader 
instance. The whole notion of freezing a memory index should be removed.

While we change this we should also clean this class up. There are many methods 
to add a field, we should just have a single method that accepts a 
`IndexableField`.

The `keywordTokenStream(...)` method is unused and untested and should be 
removed and it doesn't belong with the memory index.

The `setSimilarity(...)`, `createSearcher(...)` and `search(...)` methods 
should be removed, because the MemoryIndex should just be responsible for 
creating an IndexReader instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7391) MemoryIndexReader.fields() performance regression

2016-07-25 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391427#comment-15391427
 ] 

Martijn van Groningen commented on LUCENE-7391:
---

+1 The filtering cost is deferred to when someone really needs to know the size 
or the actual fields and I think that this is better then what happens now.

Small nit: Maybe rename the variable `ignored` to `field` in the `size()` 
method as it is actually not ignored?

I'll let David commit this as he assigned himself to this issue.

> MemoryIndexReader.fields() performance regression
> -
>
> Key: LUCENE-7391
> URL: https://issues.apache.org/jira/browse/LUCENE-7391
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Mason
>Assignee: David Smiley
> Attachments: LUCENE-7391-test.patch, LUCENE-7391.patch, 
> LUCENE-7391.patch
>
>
> While upgrading our codebase from Lucene 4 to Lucene 6 we found a significant 
> performance regression - a 5x slowdown
> On profiling the code, the method MemoryIndexReader.fields() shows up as one 
> of the hottest methods
> Looking at the method, it just creates a copy of the inner {{fields}} Map 
> before passing it to {{MemoryFields}}. It does this so that it can filter out 
> fields with {{numTokens <= 0}}.
> The simplest "fix" would be to just remove the copying of the map completely, 
> and pass {{fields}} directly to {{MemoryFields}}.  It's simple and removes 
> any slowdown caused by this method.  It does potentially change behaviour 
> though, but none of the unit tests seem to test that behaviour so I wonder 
> whether it's necessary (I looked at the original ticket LUCENE-7091 that 
> introduced this code, I can't find much in way of an explanation). I'm going 
> to attach a patch to this effect anyway and we can take things from there



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7389) Validation issue in FieldType#setDimensions?

2016-07-25 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7389.
---
   Resolution: Fixed
Fix Version/s: 6.2
   master (7.0)

Thanks Adrien and Mike!

(accidentally used from issue number in commit message)
Fixed in master: 
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9b85f68
and branch_6x: 
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=dc54f97

> Validation issue in FieldType#setDimensions?
> 
>
> Key: LUCENE-7389
> URL: https://issues.apache.org/jira/browse/LUCENE-7389
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7383.patch
>
>
> It compares if the {{dimensionCount}} is larger than 
> {{PointValues.MAX_NUM_BYTES}} while this constant should be compared to 
> {{dimensionNumBytes}} instead?
> So this if statement:
> {noformat}
> if (dimensionCount > PointValues.MAX_NUM_BYTES) {
>   throw new IllegalArgumentException("dimensionNumBytes must be <= " + 
> PointValues.MAX_NUM_BYTES + "; got " + dimensionNumBytes);
> }
> {noformat}
> Should be:
> {noformat}
> if (dimensionNumBytes > PointValues.MAX_NUM_BYTES) {
>   throw new IllegalArgumentException("dimensionNumBytes must be <= " + 
> PointValues.MAX_NUM_BYTES + "; got " + dimensionNumBytes);
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7391) MemoryIndexReader.fields() performance regression

2016-07-22 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389646#comment-15389646
 ] 

Martijn van Groningen commented on LUCENE-7391:
---

>  is it part of the contract that fields() should only return indexed fields 
> then?

Yes.

I think David's fix is the easiest here. Computing this count each time fields 
is invoked is less of an overhead compared what happens now when building 
{{MemoryFields}}. Since that count is computed each time, I think you shouldn't 
worry about caching or cache invalidation.

The concurrency aspect of the MemoryIndex is in my opinion a bit of a mess. It 
allows fields to be added to be made after a reader has been created, except 
when the freeze method is invoked (and then it should be able to be used from 
many threads). I think the MemoryIndex class itself should be kind of a builder 
that just returns an IndexReader and shouldn't be able to be used after an 
IndexReader instance has been made.

 

> MemoryIndexReader.fields() performance regression
> -
>
> Key: LUCENE-7391
> URL: https://issues.apache.org/jira/browse/LUCENE-7391
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Mason
> Attachments: LUCENE-7391.patch
>
>
> While upgrading our codebase from Lucene 4 to Lucene 6 we found a significant 
> performance regression - a 5x slowdown
> On profiling the code, the method MemoryIndexReader.fields() shows up as one 
> of the hottest methods
> Looking at the method, it just creates a copy of the inner {{fields}} Map 
> before passing it to {{MemoryFields}}. It does this so that it can filter out 
> fields with {{numTokens <= 0}}.
> The simplest "fix" would be to just remove the copying of the map completely, 
> and pass {{fields}} directly to {{MemoryFields}}.  It's simple and removes 
> any slowdown caused by this method.  It does potentially change behaviour 
> though, but none of the unit tests seem to test that behaviour so I wonder 
> whether it's necessary (I looked at the original ticket LUCENE-7091 that 
> introduced this code, I can't find much in way of an explanation). I'm going 
> to attach a patch to this effect anyway and we can take things from there



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7391) MemoryIndexReader.fields() performance regression

2016-07-22 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389623#comment-15389623
 ] 

Martijn van Groningen commented on LUCENE-7391:
---

+1 to count the number of fields with `numTerms > 0` and filter out fields with 
`numTerms <= 0`

> MemoryIndexReader.fields() performance regression
> -
>
> Key: LUCENE-7391
> URL: https://issues.apache.org/jira/browse/LUCENE-7391
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Mason
> Attachments: LUCENE-7391.patch
>
>
> While upgrading our codebase from Lucene 4 to Lucene 6 we found a significant 
> performance regression - a 5x slowdown
> On profiling the code, the method MemoryIndexReader.fields() shows up as one 
> of the hottest methods
> Looking at the method, it just creates a copy of the inner {{fields}} Map 
> before passing it to {{MemoryFields}}. It does this so that it can filter out 
> fields with {{numTokens <= 0}}.
> The simplest "fix" would be to just remove the copying of the map completely, 
> and pass {{fields}} directly to {{MemoryFields}}.  It's simple and removes 
> any slowdown caused by this method.  It does potentially change behaviour 
> though, but none of the unit tests seem to test that behaviour so I wonder 
> whether it's necessary (I looked at the original ticket LUCENE-7091 that 
> introduced this code, I can't find much in way of an explanation). I'm going 
> to attach a patch to this effect anyway and we can take things from there



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7389) Validation issue in FieldType#setDimensions?

2016-07-22 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7389:
--
Attachment: LUCENE-7383.patch

Attached fix.

Luckily this validation was also checked (correctly in FieldInfo.java line 178, 
so there shouldn't be indices with too large dimensions.

> Validation issue in FieldType#setDimensions?
> 
>
> Key: LUCENE-7389
> URL: https://issues.apache.org/jira/browse/LUCENE-7389
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Martijn van Groningen
> Attachments: LUCENE-7383.patch
>
>
> It compares if the {{dimensionCount}} is larger than 
> {{PointValues.MAX_NUM_BYTES}} while this constant should be compared to 
> {{dimensionNumBytes}} instead?
> So this if statement:
> {noformat}
> if (dimensionCount > PointValues.MAX_NUM_BYTES) {
>   throw new IllegalArgumentException("dimensionNumBytes must be <= " + 
> PointValues.MAX_NUM_BYTES + "; got " + dimensionNumBytes);
> }
> {noformat}
> Should be:
> {noformat}
> if (dimensionNumBytes > PointValues.MAX_NUM_BYTES) {
>   throw new IllegalArgumentException("dimensionNumBytes must be <= " + 
> PointValues.MAX_NUM_BYTES + "; got " + dimensionNumBytes);
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7391) MemoryIndexReader.fields() performance regression

2016-07-22 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15389577#comment-15389577
 ] 

Martijn van Groningen commented on LUCENE-7391:
---

The reason it filters out field with {{numTokens <= 0}} is that it would 
otherwise include non indexed fields (fields with just doc values or point 
values). However this slowdown is unintended. Maybe instead we could build 
`filteredFields` in the constructor of `MemoryIndexReader` and reuse it between 
`#fields()` invocations?

> MemoryIndexReader.fields() performance regression
> -
>
> Key: LUCENE-7391
> URL: https://issues.apache.org/jira/browse/LUCENE-7391
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Mason
> Attachments: LUCENE-7391.patch
>
>
> While upgrading our codebase from Lucene 4 to Lucene 6 we found a significant 
> performance regression - a 5x slowdown
> On profiling the code, the method MemoryIndexReader.fields() shows up as one 
> of the hottest methods
> Looking at the method, it just creates a copy of the inner {{fields}} Map 
> before passing it to {{MemoryFields}}. It does this so that it can filter out 
> fields with {{numTokens <= 0}}.
> The simplest "fix" would be to just remove the copying of the map completely, 
> and pass {{fields}} directly to {{MemoryFields}}.  It's simple and removes 
> any slowdown caused by this method.  It does potentially change behaviour 
> though, but none of the unit tests seem to test that behaviour so I wonder 
> whether it's necessary (I looked at the original ticket LUCENE-7091 that 
> introduced this code, I can't find much in way of an explanation). I'm going 
> to attach a patch to this effect anyway and we can take things from there



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7389) Validation issue in FieldType#setDimensions?

2016-07-22 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7389:
-

 Summary: Validation issue in FieldType#setDimensions?
 Key: LUCENE-7389
 URL: https://issues.apache.org/jira/browse/LUCENE-7389
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Martijn van Groningen


It compares if the {{dimensionCount}} is larger than 
{{PointValues.MAX_NUM_BYTES}} while this constant should be compared to 
{{dimensionNumBytes}} instead?

So this if statement:

{noformat}
if (dimensionCount > PointValues.MAX_NUM_BYTES) {
  throw new IllegalArgumentException("dimensionNumBytes must be <= " + 
PointValues.MAX_NUM_BYTES + "; got " + dimensionNumBytes);
}
{noformat}

Should be:

{noformat}
if (dimensionNumBytes > PointValues.MAX_NUM_BYTES) {
  throw new IllegalArgumentException("dimensionNumBytes must be <= " + 
PointValues.MAX_NUM_BYTES + "; got " + dimensionNumBytes);
}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7383) FieldQueryTest.testFlattenToParentBlockJoinQuery failure

2016-07-15 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7383.
---
Resolution: Fixed

thanks for raising the issue [~mikemccand]!

> FieldQueryTest.testFlattenToParentBlockJoinQuery failure
> 
>
> Key: LUCENE-7383
> URL: https://issues.apache.org/jira/browse/LUCENE-7383
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Martijn van Groningen
>
> Reproduces for me in master:
> {noformat}
>[junit4] Started J0 PID(26725@localhost).
>[junit4] Suite: org.apache.lucene.search.vectorhighlight.FieldQueryTest
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=FieldQueryTest 
> -Dtests.method=testFlattenToParentBlockJoinQuery 
> -Dtests.seed=FBAF10B3AA838B8D -Dtests.slow=true -Dtests.locale=pt 
> -Dtests.timezone=Asia/Chita -Dtests.asserts=true -Dtests.file.encoding=UTF-8
>[junit4] FAILURE 0.10s | FieldQueryTest.testFlattenToParentBlockJoinQuery 
> <<<
>[junit4]> Throwable #1: java.lang.AssertionError
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([FBAF10B3AA838B8D:6C7C115D5027C6BB]:0)
>[junit4]>  at 
> org.apache.lucene.search.vectorhighlight.AbstractTestCase.assertCollectionQueries(AbstractTestCase.java:162)
>[junit4]>  at 
> org.apache.lucene.search.vectorhighlight.FieldQueryTest.testFlattenToParentBlockJoinQuery(FieldQueryTest.java:966)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): {}, 
> docValues:{}, maxPointsInLeafNode=1120, maxMBSortInHeap=7.244053319393249, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=pt, timezone=Asia/Chita
>[junit4]   2> NOTE: Linux 4.2.0-38-generic amd64/Oracle Corporation 
> 1.8.0_92 (64-bit)/cpus=8,threads=1,free=430920456,total=504889344
>[junit4]   2> NOTE: All tests run in this JVM: [FieldQueryTest]
>[junit4] Completed [1/1 (1!)] in 0.47s, 1 test, 1 failure <<< FAILURES!
>[junit4] 
>[junit4] 
>[junit4] Tests with failures [seed: FBAF10B3AA838B8D]:
>[junit4]   - 
> org.apache.lucene.search.vectorhighlight.FieldQueryTest.testFlattenToParentBlockJoinQuery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-7383) FieldQueryTest.testFlattenToParentBlockJoinQuery failure

2016-07-15 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen reassigned LUCENE-7383:
-

Assignee: Martijn van Groningen

> FieldQueryTest.testFlattenToParentBlockJoinQuery failure
> 
>
> Key: LUCENE-7383
> URL: https://issues.apache.org/jira/browse/LUCENE-7383
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Martijn van Groningen
>
> Reproduces for me in master:
> {noformat}
>[junit4] Started J0 PID(26725@localhost).
>[junit4] Suite: org.apache.lucene.search.vectorhighlight.FieldQueryTest
>[junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=FieldQueryTest 
> -Dtests.method=testFlattenToParentBlockJoinQuery 
> -Dtests.seed=FBAF10B3AA838B8D -Dtests.slow=true -Dtests.locale=pt 
> -Dtests.timezone=Asia/Chita -Dtests.asserts=true -Dtests.file.encoding=UTF-8
>[junit4] FAILURE 0.10s | FieldQueryTest.testFlattenToParentBlockJoinQuery 
> <<<
>[junit4]> Throwable #1: java.lang.AssertionError
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([FBAF10B3AA838B8D:6C7C115D5027C6BB]:0)
>[junit4]>  at 
> org.apache.lucene.search.vectorhighlight.AbstractTestCase.assertCollectionQueries(AbstractTestCase.java:162)
>[junit4]>  at 
> org.apache.lucene.search.vectorhighlight.FieldQueryTest.testFlattenToParentBlockJoinQuery(FieldQueryTest.java:966)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): {}, 
> docValues:{}, maxPointsInLeafNode=1120, maxMBSortInHeap=7.244053319393249, 
> sim=RandomSimilarity(queryNorm=false): {}, locale=pt, timezone=Asia/Chita
>[junit4]   2> NOTE: Linux 4.2.0-38-generic amd64/Oracle Corporation 
> 1.8.0_92 (64-bit)/cpus=8,threads=1,free=430920456,total=504889344
>[junit4]   2> NOTE: All tests run in this JVM: [FieldQueryTest]
>[junit4] Completed [1/1 (1!)] in 0.47s, 1 test, 1 failure <<< FAILURES!
>[junit4] 
>[junit4] 
>[junit4] Tests with failures [seed: FBAF10B3AA838B8D]:
>[junit4]   - 
> org.apache.lucene.search.vectorhighlight.FieldQueryTest.testFlattenToParentBlockJoinQuery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7376) Add ToParentBlockJoinQuery support to FVH's FieldQuery

2016-07-14 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7376.
---
   Resolution: Fixed
Fix Version/s: 6.2
   master (7.0)

> Add ToParentBlockJoinQuery support to FVH's FieldQuery
> --
>
> Key: LUCENE-7376
> URL: https://issues.apache.org/jira/browse/LUCENE-7376
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Martijn van Groningen
>Priority: Minor
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE_7376.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7376) Add ToParentBlockJoinQuery support to FVH's FieldQuery

2016-07-13 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7376:
--
Attachment: LUCENE_7376.patch

> Add ToParentBlockJoinQuery support to FVH's FieldQuery
> --
>
> Key: LUCENE-7376
> URL: https://issues.apache.org/jira/browse/LUCENE-7376
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7376.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7376) Add ToParentBlockJoinQuery support to FVH's FieldQuery

2016-07-13 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7376:
-

 Summary: Add ToParentBlockJoinQuery support to FVH's FieldQuery
 Key: LUCENE-7376
 URL: https://issues.apache.org/jira/browse/LUCENE-7376
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: Martijn van Groningen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-06-07 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE_7304.patch

Changed the block join query to only require that parent docs store how far 
away there first child doc is (in docids).

The reduces the amount of information required to be stored in the doc values 
offset field and these offsets for the parents compress better the offset 
values before (which was composed out of more information).

I tested this patch out on a test data set 
(https://archive.org/download/stackexchange/english.stackexchange.com.7z). I 
extracted the questions, answers and comment and indexed each question with its 
answers and related comments as a hierarchical block of documents. In total 
745252 docs were indexed. The size of the doc values offset field was 839592 
bytes. 

After that I ran a query that selects all questions that have answers with 
comments (questions -> answers -> comments) for both the current block join and 
doc value block join. The the block join used 186768 bytes of jvm heap for 
bitsets and the doc values block join used 1132 bytes of jvm heap for 
references to the offset doc values field. 

So with the doc values approach, in total used roughly 4.5 times more RAM 
(assuming OS caches offset field), and the jvm memory footprint was roughly 165 
times smaller. 

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-06-07 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318393#comment-15318393
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. The last time I tried doc values, I could not use advance(target) on them. 
Is that still the case?

That is still the case. But the way the doc value block join work is by storing 
offsets (how far away is the first child doc in docids and how far away is the 
closest parent) and at query time that is being used to advance the child 
scorer. However when doc values become iterator based these offsets can be 
encoded much more efficiently then is now the case.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-06-06 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317229#comment-15317229
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

[~paul.elsc...@xs4all.nl] This is a lot of code :) I really think this should 
be moved to a new issue, not just because of this size of the patch, but also 
because the implementation is different compared to what was initially proposed 
here. Also I think that EliasFanoDocIdSet and friends shouldn't be added to 
core, but should be added the join module instead. EliasFano was superseded 
from core as general purposes docidset by other implementations a while ago and 
since now it will be used in context of block join, it makes sense to just add 
it to the join module. 

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-06-03 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7307.
---
Resolution: Fixed

[~mikemccand] yes! Thanks for reminding me.

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE-7307, LUCENE_7307.patch, LUCENE_7307.patch, 
> LUCENE_7307.patch, LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7276) Add an optional reason to the MatchNoDocsQuery

2016-06-02 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15312081#comment-15312081
 ] 

Martijn van Groningen commented on LUCENE-7276:
---

+1

> Add an optional reason to the MatchNoDocsQuery
> --
>
> Key: LUCENE-7276
> URL: https://issues.apache.org/jira/browse/LUCENE-7276
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Ferenczi Jim
>Priority: Minor
>  Labels: patch
> Attachments: LUCENE-7276.patch
>
>
> It's sometimes difficult to debug a query that results in a MatchNoDocsQuery. 
> The MatchNoDocsQuery is always rewritten in an empty boolean query.
> This patch adds an optional reason and implements a weight in order to keep 
> track of the reason why the query did not match any document. The reason is 
> printed on toString and when an explanation for noMatch is asked.  
> For instance the query:
> new MatchNoDocsQuery("Field not found").toString()
> => 'MatchNoDocsQuery["field 'title' not found"]'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-06-02 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7307:
--
Attachment: LUCENE-7307

I've updated the patch.

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE-7307, LUCENE_7307.patch, LUCENE_7307.patch, 
> LUCENE_7307.patch, LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-06-02 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311882#comment-15311882
 ] 

Martijn van Groningen commented on LUCENE-7307:
---

bq. I think the byte[] version was better since it is consitent with 
PointPangeQuery which exposes the low/high bounds as a byte[]?

Good point. I'll change that.

bq. Also I don't think it is true that the sortedPackedPoints iterator makes a 
copy: looking at PrefixCodedTerms, it seems to be reusing the same BytesRef 
object?

True... we need to make a copy. I guess what I was confused with is the copy 
that PrefixCodedTerms makes from the input, but if it is reusing the BytesRef 
it copies into then a copy is required if we return the points in a collection.

bq. Something else we should do is throwing a NoSuchElementException in the 
Iterator whin upTo==size to comply with the Iterator API.

+1!

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE_7307.patch, LUCENE_7307.patch, LUCENE_7307.patch, 
> LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-06-02 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311868#comment-15311868
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. There is a dilemma here: either introduce DocBlocksIterator, or not 
implement MutableBits.

The block join queries are not using any of the methods that modify the bitset, 
so I think it is fine to not implement clear() and set() methods. Also it will 
not be a general purpose bitset, but specialized for the block join.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-06-01 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310055#comment-15310055
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. This is only to show a possible direction, BitSetProducer in the join 
queries may also need to be replaced by a DocBlocksIteratorProducer.

Cool. Lets iterate on this approach in a new issue? So that this issue can 
focus on the doc values based approach.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-06-01 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7307:
--
Attachment: LUCENE_7307.patch

Forgot to add an assertion in the test.

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE_7307.patch, LUCENE_7307.patch, LUCENE_7307.patch, 
> LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-06-01 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7307:
--
Attachment: LUCENE_7307.patch

I've updated the patch and the returned collection is now a view.
Also I changed the return type from Collection to Collection 
because the sortedPackedPoints iterator already makes a copy and returns that 
as BytesRef.

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE_7307.patch, LUCENE_7307.patch, LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-06-01 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7307:
--
Attachment: LUCENE_7307.patch

bq. Also I don't think sortedPackedPointsHashCode should have a getter,

Oops, totally agree.

I updated the patch. The low/high points are cloned the sortedPackedPoints are 
exposed as Collection.

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE_7307.patch, LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-05-31 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7307:
--
Attachment: LUCENE_7307.patch

> Add getters to PointInSetQuery and PointRangeQuery classes
> --
>
> Key: LUCENE-7307
> URL: https://issues.apache.org/jira/browse/LUCENE-7307
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Trivial
> Attachments: LUCENE_7307.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7307) Add getters to PointInSetQuery and PointRangeQuery classes

2016-05-31 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7307:
-

 Summary: Add getters to PointInSetQuery and PointRangeQuery classes
 Key: LUCENE-7307
 URL: https://issues.apache.org/jira/browse/LUCENE-7307
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Martijn van Groningen
Priority: Trivial
 Attachments: LUCENE_7307.patch





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-05-31 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307395#comment-15307395
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

Having different block join implementations with different trade offs around is 
good. If EliasFanoDocIdSet can extend from `BitSet` then I think it would be a 
nice addition to the jojn module, so that `ToParentBlockJoinQuery` and friends 
can use it as `parentsFilter`. This way the block join that exists today can be 
improved in certain scenarios (I think that largely depends on how dense this 
parentsFilter is. Typically it tends to be on the dense side).

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE-5092-20140313.patch, LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-05-30 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306555#comment-15306555
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. I still have an EliasFanoDocIdSet that could be used for block joins, see 
LUCENE-5092.

I'm not familiar with EliasFanoDocIdSet, but can that implementation go iterate 
backwards? The link to the pull request mentioned in that issue gives a 404 and 
from the patch in LUCENE-6484 it doesn't seem this is supported.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-05-27 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304145#comment-15304145
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. Does this approach work out to less than one bit per doc? 

Unfortunately it is more than that. But with current block join implementation 
the memory cost does increase (requires extra bit sets) when there are multiple 
levels of parent-child relations, while with this approach the memory costs 
remains the same (it just needs one numeric doc values field to encode the 
multiple layers of document blocks). 

bq. our doc values compression isn't THAT good yet

Maybe if doc values becomes an iterator based, then I guess with delta 
encoding, we could get closer to 1 bit per doc?



> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-05-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302697#comment-15302697
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. If we switched block joins to use numeric doc values, I am wondering if we 
would ever need to read doc values in reverse order? 

Yes, in this patch, but I think the logic can be changed, so that at least doc 
values don't need to be read in reverse. Currently there is one offset field 
holding both the offset the parent for child docs and offset to the first child 
for parents. This can be split up in two fields, so that doc values never has 
to be read in reverse.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7304) Doc values based block join implementation

2016-05-26 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302569#comment-15302569
 ] 

Martijn van Groningen commented on LUCENE-7304:
---

bq. I wonder... instead couldn't we get a DocIdSetIterator of parent docs and 
kind of intersect it with the child DISI?

I wondered that a while ago too, but we can't go backwards with 
`DocIdSetIterator` and this what the advance method 
('parentBits.prevSetBit(parentTarget-1)') requires of the block join query to 
figure out where the first child starts for 'parentTarget'.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

2016-05-26 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen updated LUCENE-7304:
--
Attachment: LUCENE_7304.patch

Attached a working version of a doc values based block join query. 
The app storing docs is responsible for adding the numeric doc values field 
with the right offsets.

> Doc values based block join implementation
> --
>
> Key: LUCENE-7304
> URL: https://issues.apache.org/jira/browse/LUCENE-7304
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Martijn van Groningen
>Priority: Minor
> Attachments: LUCENE_7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-7304) Doc values based block join implementation

2016-05-26 Thread Martijn van Groningen (JIRA)

Martijn van Groningen created LUCENE-7304:
-

 Summary: Doc values based block join implementation
 Key: LUCENE-7304
 URL: https://issues.apache.org/jira/browse/LUCENE-7304
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Martijn van Groningen
Priority: Minor


At query time the block join relies on a bitset for finding the previous parent 
doc during advancing the doc id iterator. On large indices these bitsets can 
consume large amounts of jvm heap space.  Also typically due the nature how 
these bitsets are set, the 'FixedBitSet' implementation is used.

The idea I had was to replace the bitset usage by a numeric doc values field 
that stores offsets. Each child doc stores how many docids it is from its 
parent doc and each parent stores how many docids it is apart from its first 
child. At query time this information can be used to perform the block join.

I think another benefit of this approach is that external tools can now easily 
determine if a doc is part of a block of documents and perhaps this also helps 
index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7206) nest child query explain into ToParentBlockJoinQuery.BlockJoinScorer.explain(int)

2016-05-25 Thread Martijn van Groningen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martijn van Groningen resolved LUCENE-7206.
---
   Resolution: Fixed
Fix Version/s: master (7.0)
   6.1

> nest child query explain into 
> ToParentBlockJoinQuery.BlockJoinScorer.explain(int)
> -
>
> Key: LUCENE-7206
> URL: https://issues.apache.org/jira/browse/LUCENE-7206
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Affects Versions: 6.0
>Reporter: Mikhail Khludnev
>  Labels: newbie, newdev
> Fix For: 6.1, master (7.0)
>
> Attachments: LUCENE-7206-one-child-with-tests.patch, 
> LUCENE-7206-test.patch, LUCENE-7206.diff
>
>
> Now to parent query match is explained with {{Score based on child doc range 
> from .. to .. }} that's quite useless. 
> It's proposed to nest child query match explanation from the first matching 
> child document into parent explain. 
> WDYT?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7206) nest child query explain into ToParentBlockJoinQuery.BlockJoinScorer.explain(int)

2016-05-25 Thread Martijn van Groningen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15299976#comment-15299976
 ] 

Martijn van Groningen commented on LUCENE-7206:
---

Ilya: Thanks! This looks good. I'll push this shortly.

> nest child query explain into 
> ToParentBlockJoinQuery.BlockJoinScorer.explain(int)
> -
>
> Key: LUCENE-7206
> URL: https://issues.apache.org/jira/browse/LUCENE-7206
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Affects Versions: 6.0
>Reporter: Mikhail Khludnev
>  Labels: newbie, newdev
> Attachments: LUCENE-7206-one-child-with-tests.patch, 
> LUCENE-7206-test.patch, LUCENE-7206.diff
>
>
> Now to parent query match is explained with {{Score based on child doc range 
> from .. to .. }} that's quite useless. 
> It's proposed to nest child query match explanation from the first matching 
> child document into parent explain. 
> WDYT?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 678 matches

Mail list logo