Store arrays in DocValues and keep the original order

2022-06-28 Thread linfeng lu
Hi~

We are trying to build an OLAP database based on lucene, and we heavily use 
lucene's DocValues (as our column store).

We try to use DocValues to store the array type field. For example, if we want 
to store the field1 and feild2 in this json document into DocValues 
respectively, SORTED_NUMERIC and SORTED_SET seem to be our only option.

{
"field1": [ 3, 1, 1, 2 ],
"field2": [ "c", "a", "a", "b" ]
}


When we store field1 in SORTED_NUMERIC and field2 in SORTED_SET, we will get 
this result:

[Community Verified icon]

field1:

  *   origin: [3, 1, 1, 2]
  *   in SORTED_NUMERIC: [1, 1, 2, 3]

field2:

  *   origin: [”c”, “a”, “a”, “b” ]
  *   in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]

The original ordering relationship of the elements in the array is lost.

We're guessing that lucene's DocValues are designed primarily for sorting and 
aggregation, so the original order of elements may not matter.

But in our usage scene, it is important to keep the original order of the 
elements in the array (we allow user to access the elements in the array using 
the subscript operator).

We wonder if lucene has plans to add new types of DocValues that can store 
arrays and keep the original order of elements in the array?

Thanks!


Re: Store arrays in DocValues and keep the original order

2022-06-28 Thread Shai Erera
Depending on what you use the field for, you can use BinaryDocValuesField
which encodes a byte[] and lets you store the data however you want. But
how are you using these fields later at search time?

On Tue, Jun 28, 2022 at 3:46 PM linfeng lu  wrote:

> Hi~
>
> We are trying to build an OLAP database based on lucene, and we heavily
> use lucene's *DocValues* (as our column store).
>
> *We try to use DocValues to store the array type field. *For example, if
> we want to store the *field1* and *feild2* in this json document into
> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our
> only option.
>
> *{*
> *"field1": [ 3, 1, 1, 2 ], *
> *"field2": [ "c", "a", "a", "b" ] *
> *}*
>
>
> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we
> will get this result:
>
> *[image: Community Verified icon]*
>
> field1:
>
>- origin: [3, 1, 1, 2]
>- in SORTED_NUMERIC: [1, 1, 2, 3]
>
> field2:
>
>- origin: [”c”, “a”, “a”, “b” ]
>- in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”]
>
>
> The original ordering relationship of the elements in the array is lost.
>
> We're guessing that lucene's DocValues are designed primarily for sorting
> and aggregation, so the original order of elements may not matter.
>
> But in our usage scene, it is important to keep the original order of the
> elements in the array (we allow user to access the elements in the array
> using the subscript operator).
>
> We wonder if lucene has plans to add new types of DocValues that can store
> arrays and keep the original order of elements in the array?
>
> Thanks!
>


Re: Finding out which fields matched the query

2022-06-28 Thread Alan Woodward
I think it depends on what information we actually want to get here.  If it’s 
just finding which fields matched in which document, then running Matches over 
the top-k results is fine.  If you want to get some kind of aggregate data, as 
in you want to get a list of fields that matched in *any* document (or 
conversely, a list of fields that *didn’t* match - useful if you want to prune 
your schema, for example), then Matches will be too slow.  But at the same 
time, queries are designed to tell you which *documents* match efficiently, and 
they are allowed to advance their sub-queries lazily or indeed not at all if 
the result isn’t needed for scoring.  So we don’t really have any way of 
finding this kind of information via a collector that is accurate and performs 
reasonably.

It *might* be possible to rework Matches so that they act more like an iterator 
and maintain their state within a segment, but there hasn’t been a pressing 
need for that so far.

> On 27 Jun 2022, at 12:46, Shai Erera  > wrote:
> 
> Thanks Alan, yeah I guess I was thinking about the usecase I described, which 
> involves (usually) simple term queries, but you're definitely right about 
> complex boolean clauses as well non-term queries.
> 
> I think the case for highlighter is different though? I mean you usually 
> generate highlights only for the top-K results and therefore are probably 
> less affected by whether the matches() API is slower than a Collector. And if 
> you invoke the API for every document in the index, it might be much slower 
> (depending on the index size) than the Collector.
> 
> Maybe a hybrid approach which runs the query and caches the docs in a 
> DocIdSet (like FacetsCollector does) and then invokes the matches() API only 
> on those hits, will let you enjoy the best of both worlds? Assuming though 
> that the number of matching documents is not huge.
> 
> So it seems there are several options and one should choose based on their 
> usecase. Do you see an advantage for Lucene to offer a Collector for this 
> usecase? Or should we tell users to use the matches API
> 
> Shai
> 
> On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss  > wrote:
> A side note - I've been using a highlighter based on matches API for
> quite some time now and it's been fantastic. Very precise and handles
> non-trivial queries (interval queries) very well.
> 
> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>  
> 
> 
> Dawid
> 
> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward  > wrote:
> >
> > Your approach is almost certainly more efficient, but it might give you 
> > false matches in some cases - for example, if you have a complex query with 
> > many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is 
> > positioned on the correct document, but which is part of a clause that 
> > doesn’t actually match.  It also only works for term queries, so it won’t 
> > match phrases or span/interval groups.  And Matches will work on points or 
> > docvalues queries as well.  The reason I added Matches in the first place 
> > was precisely to handle these weird corner cases - I had written 
> > highlighters which more or less did the same thing you describe with a 
> > Collector and the Scorable tree, and I would occasionally get bad 
> > highlights back.
> >
> > On 27 Jun 2022, at 10:51, Shai Erera  > > wrote:
> >
> > Out of curiosity and for education purposes, is the Collector approach I 
> > proposed wrong/inefficient? Or less efficient than the matches() API?
> >
> > I'm thinking, if you want to both match/rank documents and as a side effect 
> > know which fields matched, the Collector will perform better than 
> > Weight.matches(), but I could be wrong.
> >
> > Shai
> >
> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss  > > wrote:
> >>
> >> The matches API is awesome. Use it. You can also get a rough glimpse
> >> into a superset of fields potentially matching the query via:
> >>
> >> query.visit(
> >> new QueryVisitor() {
> >>   @Override
> >>   public boolean acceptField(String field) {
> >> affectedFields.add(field);
> >> return false;
> >>   }
> >> });
> >>
> >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
> >>  
> >> 
> >>
> >> I'd go with the Matches API though.
> >>
> >> Dawid
> >>
> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward  >> > wrote:
> >> >
> >> > The Matches API will give you this information - it’s 

Re: A prototype migration tool Jira to GitHub

2022-06-28 Thread Tomoko Uchida
I finished the second prototype. With a few exceptions, almost all existing
issues were successfully migrated into the test repo. You can browse/search
them.
https://github.com/mocobeta/sandbox-lucene-10557/issues

Some limitations in the first prototype have been addressed. For example,
we can preserve the original timestamp of the issues/comments.
I could list improvements and current limitations though, could you try it
out yourself; any issues should be found by Jira issue numbers.
Note that "attachments" are still not ported. We've found workarounds so it
will be addressed in the next iteration.

I don't think we reached a conclusion, though, I fully recognize there are
strong requests on the atomic switch to GitHub and I haven't seen
objections on that so far - then I'll continue to work on improving the
migration quality.
I would finish playing around with prototyping and if there are next
iterations, these will be rehearsals for the actual migration.


Tomoko


2022年6月27日(月) 10:27 Tomoko Uchida :

> > It looks like the GitHub Danger Zone can transfer a repository?
>
> "Transferring a repository" creates another repository different from
> apache/lucene. It'd make the migration process easy though, is it our
> intention to have an external repository for old issues?
>
> Tomoko
>
>
> 2022年6月27日(月) 8:24 Michael McCandless :
>
>> It looks like the GitHub Danger Zone can transfer a repository?
>>
>> It's not clear if it can go from Personal -> Organization though.  I see
>> Personal -> Personal and Organization -> Organization.
>>
>>
>> https://docs.github.com/en/repositories/creating-and-managing-repositories/transferring-a-repository
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, Jun 26, 2022 at 6:40 PM Tomoko Uchida <
>> tomoko.uchida.1...@gmail.com> wrote:
>>
>>>
>>>
>>>
>>>

 2022年6月27日(月) 5:16 Michael Sokolov :

> as for this access control/script monitoring problem, I wonder whether
> we could import all the issues into a new github repo owned by
> whomever is running the script, and then transfer from there to the
> lucene repo? It would be an extra step involving another script (or
> something), but maybe(?) that one could be much simpler since it is
> github->github?? If this works out, we could have full control of the
> first step and only hand off to infra the simpler copying job.
>
>
 I don't see the API or tool that transfers all issues from one repo to
 another repo.

>>>
>>> To be exact, I don't see the API or tool that transfers all issues from
>>> one repo to another repo while keeping cross-issue links.
>>> If we want to preserve cross-issue links, there's no difference between
>>> "Jira to GitHub" and "GitHub to GitHub".
>>>
>>>

> On Sat, Jun 25, 2022 at 7:53 AM Tomoko Uchida
>  wrote:
> >
> > I may have to share another practical consideration on the migration
> that I haven't mentioned yet.
> >
> > We are not allowed to have admin access to the lucene GitHub repo,
> so can't run the import job(s) on ourselves.
> > We'll have to make a tool with clear instructions for the migration
> and pass it to infra team, then support them via the jira (or slack?) if
> there are any problems.
> > See https://issues.apache.org/jira/browse/INFRA-20118
> >
> > We can do some preparation locally (e.g. dump Jira issues and
> convert them to importable format to GitHub), but the actual first and
> second pass import will be done by infra team.
> > I think I myself won't be able to have close contact with the infra
> team if the migration operation is too complicated due to the time
> difference and my communication ability - I'm not good at real-time
> conversation in English.
> > So if we need a complex migration plan, I think I'll have to find
> someone who is willing to take over the job.
> >
> >
> >
> > 2022年6月25日(土) 19:19 Tomoko Uchida :
> >>
> >> Hi Dawid,
> >>
> >> > Emm.. sorry for being slow - what is it that you want me to do?
> :) Unwatch->Ignore?
> >>
> >> I'm sorry for being ambiguous. Could you set your notification
> setting on the repository as "Participating and @mentions"?
> >> In the testing of migration scripts, I will import many fake issues
> where your account is linked as the original reporter/author with real
> mentions, like this example.
> >> https://github.com/mocobeta/migration-test-1/issues/111
> >> If they do not disturb your inbox with spam notifications then the
> test is successful.
> >>
> >> With regard to attachments:
> >>
> >> > 1) create a (separate?) git repository or branch with a separate
> root in the lucene repository with all jira attachments upon importing 
> them.
> >> > 2) there are about 7k issues with attachments in Jira. We can
> split them into 25-issue batches and