subject:"File search progress\: database review and question on triggers"

Re: File search progress: database review and question on triggers

2020-10-21 Thread Pierre Neidhardt

Ludovic Courtès  writes:

> A client-side approach (not involving guix-daemon) would be more readily
> usable, though some of the questions above remain open.

I'd also prefer to stick to the client side.  But how can I trigger an
event when a package gets built?

Maybe we could hook into specific commands like `guix build', collect
the list of builds that are going to be done, and upon completion, fill
the database with the build results.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-10-21 Thread Ludovic Courtès

Pierre Neidhardt  skribis:

> Ludovic Courtès  writes:
>
>> It first tries ‘query-path-info’, which succeeds if the store item is
>> available and contains info about its size, references, and so on.
>>
>> When ‘query-path-info’ fails, it falls back to
>> ‘query-substitutable-path-info’, which allows it to obtain the same
>> information for substitutes.  Thus, it doesn’t need to have the store
>> item available locally.
>
> So we could do the same with `guix filesearch`:
>
> - First try the entry in the database.
>
> - If not there, try query-path-info and if it succeeds, populate the
>   database.
>
> - If query-path-info does not succeed, try our new
>   query-substitutable-filesearch-info.
>
> That would work, no?

Sure, but it’s a big change on the daemon size (you need this new RPC, a
new database, a way to update this database, presumably orthogonal to
substitutes (?), etc.)

A client-side approach (not involving guix-daemon) would be more readily
usable, though some of the questions above remain open.

Thanks,
Ludo’.

Re: File search progress: database review and question on triggers

2020-10-17 Thread Pierre Neidhardt

Pierre Neidhardt  writes:

> So we could do the same with `guix filesearch`:
>
> - First try the entry in the database.
>
> - If not there, try query-path-info and if it succeeds, populate the
>   database.
>
> - If query-path-info does not succeed, try our new
>   query-substitutable-filesearch-info.
>
> That would work, no?

Oops, I must have been distracted when writing this, of course this
cannot work "just in time" because we `guix filesearch' does not take
the local package as argument (unlike `guix size').

The local package must be indexed right after it's built, there is no
way around it.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-10-16 Thread Ludovic Courtès

Pierre Neidhardt  skribis:

> Ludovic Courtès  writes:
>
>>> Question: How do I hook onto =guix build=?
>>
>> You would need a build-completion hook in the daemon, which doesn’t
>> exist (yet!).  Note also that at this level we only see derivations, not
>> packages.
>
> Hmm... Can you explain me how =guix size= works with local builds?  Does
> it compute the size on demand or is it stored somewhere?

It first tries ‘query-path-info’, which succeeds if the store item is
available and contains info about its size, references, and so on.

When ‘query-path-info’ fails, it falls back to
‘query-substitutable-path-info’, which allows it to obtain the same
information for substitutes.  Thus, it doesn’t need to have the store
item available locally.

HTH!

Ludo’.

Re: File search progress: database review and question on triggers

2020-10-14 Thread Pierre Neidhardt

Ludovic Courtès  writes:

>> Question: How do I hook onto =guix build=?
>
> You would need a build-completion hook in the daemon, which doesn’t
> exist (yet!).  Note also that at this level we only see derivations, not
> packages.

Hmm... Can you explain me how =guix size= works with local builds?  Does
it compute the size on demand or is it stored somewhere?

-- 
Pierre Neidhardt
https://ambrevar.xyz/

signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-10-13 Thread Ludovic Courtès

Pierre Neidhardt  skribis:

>> “Something” needs to build the file-to-package database (which is what
>> you’re working on), and then there needs to be a way for users to fetch
>> that database.  This is all orthogonal to substitutes, as I see it,
>> which is why I think we need to think about integrating it maybe with
>> ‘guix publish’ on the server side and (guix channels) on the client
>> side.
>
> If the database is filled on =guix build=, then the substitute server
> would automatically have a filesearch database.
>
> Question: How do I hook onto =guix build=?

You would need a build-completion hook in the daemon, which doesn’t
exist (yet!).  Note also that at this level we only see derivations, not
packages.

>> Ah got it; having a database known to correspond to a specific commit is
>> even better.
>>
>> However, note that it could take time for the server to provide the
>> complete database for a commit (the time for as many packages as
>> possible to be built), so clients may want to refresh it anyway, or even
>> perhaps to use an older database.
>
> Indeed, and that's why I think we need to include a timestamp: if the
> client's database timestamp is older than that of the substitute server
> database, then we re-fetch it.

Yeah, maybe that can work.

Thanks,
Ludo’.

Re: File search progress: database review and question on triggers

2020-10-13 Thread Ludovic Courtès

Pierre Neidhardt  skribis:

> Ludovic Courtès  writes:
>
>> I would lean towards keeping it separate, so that it’s an optional
>> feature (given that it relies on downloading an external database).
>
> I was leaning towards downloading the database with "guix pull", so that
> the "filesearch" subcommand would not download anything.

Like substitutes, it should be an optional feature IMO, which means we
need to keep clear boundaries.

There’s also the question of whether we should have a system-wide
database so that each user doesn’t have to download it, especially if
it’s relatively big.

Apologies for throwing more questions than answers into the mix.  :-)

Ludo’.

Re: File search progress: database review and question on triggers

2020-10-13 Thread Ludovic Courtès

Hi Pierre,

Pierre Neidhardt  skribis:

> Ludovic Courtès  writes:

[...]

>> It would be nice to see whether/how this could be integrated with
>> third-party channels.  Of course it’s not a priority, but while
>> designing this feature, we should keep in mind that we might want
>> third-party channel authors to be able to offer such a database for
>> their packages.
>
> Wouldn't it work automatically for any substitute server?

“Something” needs to build the file-to-package database (which is what
you’re working on), and then there needs to be a way for users to fetch
that database.  This is all orthogonal to substitutes, as I see it,
which is why I think we need to think about integrating it maybe with
‘guix publish’ on the server side and (guix channels) on the client
side.

>>> - Find a way to garbage-collect the database(s).  My intuition is that
>>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>>   Guix checkout we collect the corresponding database.
>>
>> If we download a fresh database every time, we might as well simply
>> overwrite the one we have?
>
> But then we would miss the database when switching Guix generation.

Ah got it; having a database known to correspond to a specific commit is
even better.

However, note that it could take time for the server to provide the
complete database for a commit (the time for as many packages as
possible to be built), so clients may want to refresh it anyway, or even
perhaps to use an older database.

Thanks,
Ludo’.

Re: File search progress: database review and question on triggers

2020-10-12 Thread zimoun

On Mon, 12 Oct 2020 at 12:20, Ludovic Courtès  wrote:

>> - Textual database: slow and not lighter than SQLite.  Not worth it I 
>> believe.
>>
>> - SQLite without full-text search: fast, supports classic patterns
>>   (e.g. "foo*bar") but does not support word permutations.
>>
>> - SQLite with full-text search: fast, supports word permutations but
>>   does not support suffix-matching (e.g. "bar" won't match "foobar").
>>   Size is about the same as without full-text search.
>>
>> - Include synopsis and descriptions.  Maybe we should include all fields
>>   that are searched by `guix search`.  This incurs a cost on the
>>   database size but it would fix the `guix search` speed issue.  Size
>>   increases by some 10 MiB.
>
> Oh so this is going beyond file search, right?
>
> Perhaps it would make sense to focus on file search only as a first
> step, and see what can be done with synopses/descriptions (like Arun and
> zimoun did before) later, separately?

Well, the first patch set that Arun sent for improving “guix search” was
the introduction of a SQLite database, replacing the current
’package.cache’.  And I quote your wise advice:

I would rather keep the current package cache as-is instead of
inserting sqlite in here.  I don’t expect it to bring much
compared performance-wise to the current simple cache
(especially if we look at load time), and it does increase
complexity quite a bit.

However, using sqlite for keyword search as you initially
proposed on guix-devel does sound like a great idea to me.

Message-ID: <87sgjhx92g@gnu.org>

Therefore, if Pierre is going to introduce a SQL database where the
addition of the synopses/descriptions is cheap, it seems a good idea to
use it, isn’t it?  Keeping the ’package.cache’ as-is.  And in parallel,
“we“ can try to use this WIP branch for improving the speed of “guix
search” (by “we”, I mean that I plan to work on).

BTW, somehow, it would be really easy to remove these 2 extra fields if
it is not concluding for search, since it is only the function
’add-files’:

--8<---cut here---start->8---
(with-statement
db
(string-append "insert into Info (name, synopsis, description, package)"
   " values (:name, :synopsis, :description, :id)")
stmt
  (sqlite-bind-arguments stmt
 #:name name
 #:synopsis synopsis
 #:description description
 #:id id)
--8<---cut here---end--->8---

and used only once by ’persist-package-files’.

> It would be nice to see whether/how this could be integrated with
> third-party channels.  Of course it’s not a priority, but while
> designing this feature, we should keep in mind that we might want
> third-party channel authors to be able to offer such a database for
> their packages.

If the third-party channels also provides substitutes, then it would be
part of the substitutes, or easy to build from the substitute meta-data.

>> - Find a way to garbage-collect the database(s).  My intuition is that
>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>   Guix checkout we collect the corresponding database.
>
> If we download a fresh database every time, we might as well simply
> overwrite the one we have?

But you do not want to download it again if you roll-back for example.
>From my point of view, it should be the same mechanism as
’package.cache’.

Cheers,
simon

Re: File search progress: database review and question on triggers

2020-10-12 Thread Ludovic Courtès

Hi,

Pierre Neidhardt  skribis:

> Of course, `guix filesearch` hasn't been implemented yet ;)
>
> We still need to decide whether we want to make it part of `guix search'
> or define a separate command.

I would lean towards keeping it separate, so that it’s an optional
feature (given that it relies on downloading an external database).

Thanks,
Ludo’.

Re: File search progress: database review and question on triggers

2020-10-12 Thread Ludovic Courtès

Hi!

Pierre Neidhardt  skribis:

>> Could you post a summary of what you have done, what’s left to do, and
>> how you’d like to integrate it?  (If you’ve already done it, my
>> apologies, but you can resend a link.  :-))
>
> What I've done: mostly a database benchmark.
>
> - Textual database: slow and not lighter than SQLite.  Not worth it I believe.
>
> - SQLite without full-text search: fast, supports classic patterns
>   (e.g. "foo*bar") but does not support word permutations.
>
> - SQLite with full-text search: fast, supports word permutations but
>   does not support suffix-matching (e.g. "bar" won't match "foobar").
>   Size is about the same as without full-text search.
>
> - Include synopsis and descriptions.  Maybe we should include all fields
>   that are searched by `guix search`.  This incurs a cost on the
>   database size but it would fix the `guix search` speed issue.  Size
>   increases by some 10 MiB.

Oh so this is going beyond file search, right?

Perhaps it would make sense to focus on file search only as a first
step, and see what can be done with synopses/descriptions (like Arun and
zimoun did before) later, separately?

> What's left to do:
>
> - Populate the database on demand, either after a `guix build` or from a
>   `guix filesearch...`.  This is important so that `guix filesearch`
>   works on packages built locally.  If `guix build`, I need help to know
>   where to plug it in.
>
> - Adapt Cuirass so that it builds its file database.
>   I need pointers to get started here.
>
> - Sync the databases from the substitute server to the client when
>   running `guix filesearch`.  For this I suggest we send the compressed
>   database corresponding to a guix generation over the network (around
>   10 MiB).  Not sure sending just the delta is worth it.

It would be nice to see whether/how this could be integrated with
third-party channels.  Of course it’s not a priority, but while
designing this feature, we should keep in mind that we might want
third-party channel authors to be able to offer such a database for
their packages.

> - Find a way to garbage-collect the database(s).  My intuition is that
>   we should have 1 database per Guix checkout and when we `guix gc` a
>   Guix checkout we collect the corresponding database.

If we download a fresh database every time, we might as well simply
overwrite the one we have?

Thanks,
Ludo’.

Re: File search progress: database review and question on triggers

2020-10-11 Thread zimoun

On Sun, 11 Oct 2020 at 16:25, Pierre Neidhardt  wrote:

> Maybe you misunderstood a point: the filesearch database is not a
> database of _all store items_, but only of the items that correspond to
> the packages of a given Guix generation.

Yes, it is clear for me.  I meant: “all the store items of one specific
Guix generation/revision”.  So all the outputs files of ~15k packages.

>> Well, I think it will be more with all the items of all the packages.
>
> No, the 8 MiB include _all the packages_ of a Guix generation.
> We never include the complete store, it would not make sense for filesearch.

[...]

> --8<---cut here---start->8---
>   The database will all package descriptions and synopsis is 46 MiB and
>   compresses down to 11 MiB in zstd.
> --8<---cut here---end--->8---

Sorry, I have overlooked the code.  All is fine. :-)

I will try to adapat the “guix search” to use the database instead.

Cheers,
simon

Re: File search progress: database review and question on triggers

2020-10-11 Thread zimoun

Hi Pierre,

I am trying to resume the work on "guix search" to improve it (faster).
That's why I am asking the details.  :-)
Because with the introduction of this database, as mentioned earlier,
2 annoyances could be fixed at once.


On Sun, 11 Oct 2020 at 13:19, Pierre Neidhardt  wrote:

> > --8<---cut here---start->8---
> > echo 3 > /proc/sys/vm/drop_caches
> > time updatedb --output=/tmp/store.db --database-root=/gnu/store/
> >
> > real0m19.903s
> > user0m1.549s
> > sys 0m4.500s
>
> I don't know the size of your store nor your hardware.  Could you
> benchmark against my filesearch implementation?

30G as I reported in my previous email. ;-)


> > From my point of view, yes.  Somehow “filesearch” is a subpart of
> > “search”.  So it should be the machinery.
>
> I'll work on it.  I'll try to make the code flexible enough so that it
> can be moved to another command easily, should we decide that "search"
> is not the right fit.

UI does not matter so much at this point, I guess.  But the nice final
UI should be:

   guix search --file=


> > For example, I just did “guix pull” and “–list-generation” says from
> > f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
> >
> >39.9 MB will be download
> >
> > more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
> >
> > Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> > ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
> >
> >   The database will all package descriptions and synopsis is 46 MiB
> >   and compresses down to 11 MiB in zstd.
>
> I should have benchmarked with Lzip, it would have been more useful.  I
> think we can get it down to approximately 8 MiB in Lzip.

Well, I think it will be more with all the items of all the packages.
My point is: the database will be comparable in size with the bits of
"guix pull"; it is not much but still something.


> > which is better but still something.  Well, it is not affordable to
> > fetch the database with “guix pull”, In My Humble Opinion.
>
> We could send a "diff" of the database.

This means to setup server side, right?  So implement the "diff" in
"guix publish", right?  Hum? I feel it is overcomplicated.


> For instance, if the user already has a file database for the Guix
> generation A, then guix pulls to B, the substitute server can send the
> diff between A and B.  This would probably amount to less than 1 MiB if
> the generations are not too far apart.  (Warning: actual measures needed!)

Well, what is the size of for a full /gnu/store/ containing all the
packages of one specific revision?  Sorry if you already provided this
information, I have missed it.


> > Therefore, the database would be fetched at the first “guix search”
> > (assuming point above).  But now, how “search” could know what is custom
> > build and what is not?  Somehow, “search” should scan all the store to
> > be able to update the database.
> >
> > And what happens each time I am doing a custom build then “filesearch”.
> > The database should be updated, right?  Well, it seems almost unusable.
>
> I mentioned this previously: we need to update the database on "guix
> build".  This is very fast and would be mostly transparent to the user.
> This is essentially how "guix size" behaves.

Ok.


> > The model “updatedb/locate” seems better.  The user updates “manually”
> > if required and then location is fast.
>
> "manually" is not good in my opinion.  The end-user will inevitably
> forget.  An out-of-sync database would return bad results which is a
> big no-no for search.  On-demand database updates are ideals I think.

The tradeoff is:
  - when is "on-demand"?  When updates the database?
  - still fast when I search
 - do not slow down other guix subcommands


What you are proposing is:

 - when "guix search --file":
 + if the database does not exist: fetch it
 + otherwise: use it
 - after each "guix build", update the database

Right?

I am still missing the other update mechanism for updating the database.

(Note that the "fetch it" could be done at "guix pull" time which is
more meaningful since pull requires network access as you said.  And
the real computations for updating could be done at the first "guix
search --file" after the pull.)


> Possibly using a "diff" to shrink the download size.
>
> >  - otherwise: use this database
> >  - optionally update the database if the user wants to include new
> >  custom items.
>
> No need for the optional point I believe.

Note that since the same code is used on build farms and their store
is several TB (see recent discussion about "guix gc" on Berlin that
takes hours), the build and update of the database need some care. :-)


> >> - Find a way to garbage-collect the database(s).  My intuition is that
> >>   we should have 1 database per Guix checkout and when we `guix gc` a
> >>   Guix checkout we collect the corresponding database.
> >
> > Well, the exact same strategy as

Re: File search progress: database review and question on triggers

2020-10-11 Thread Pierre Neidhardt

Hi Zimoun,

Thanks for the feedback!

> --8<---cut here---start->8---
> echo 3 > /proc/sys/vm/drop_caches
> time updatedb --output=/tmp/store.db --database-root=/gnu/store/
>
> real0m19.903s
> user0m1.549s
> sys 0m4.500s

I don't know the size of your store nor your hardware.  Could you
benchmark against my filesearch implementation?

> And then “locate” support regexp and regex and it is fast enough.

But locate does not support word permutations, which is a very important
feature for filesearch in my opinion.

> The only point is that regexp is always cumbersome for me.  Well: «Some
> people, when confronted with a problem, think "I know, I'll use regular
> expressions." Now they have two problems.» :-) [1]

Exactly.  Full text search is a big step forward for usability I think.

> From my point of view, yes.  Somehow “filesearch” is a subpart of
> “search”.  So it should be the machinery.

I'll work on it.  I'll try to make the code flexible enough so that it
can be moved to another command easily, should we decide that "search"
is not the right fit.

> From my point of view, how to transfer the database from substitutes to
> users and how to locally update (custom channels or custom load path) are
> not easy.  Maybe the core issues.

Absolutely.

> For example, I just did “guix pull” and “–list-generation” says from
> f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::
>
>39.9 MB will be download
>
> more the tiny bits before “Computing Guix derivation”.  Say 50MB max.
>
> Well, the “locate” database for my “/gnu/store” (~30GB) is already to
> ~50MB, and ~20MB when compressed with gzip.  And Pierre said:
>
>   The database will all package descriptions and synopsis is 46 MiB
>   and compresses down to 11 MiB in zstd.

I should have benchmarked with Lzip, it would have been more useful.  I
think we can get it down to approximately 8 MiB in Lzip.

> which is better but still something.  Well, it is not affordable to
> fetch the database with “guix pull”, In My Humble Opinion.

We could send a "diff" of the database.

For instance, if the user already has a file database for the Guix
generation A, then guix pulls to B, the substitute server can send the
diff between A and B.  This would probably amount to less than 1 MiB if
the generations are not too far apart.  (Warning: actual measures needed!)

> Therefore, the database would be fetched at the first “guix search”
> (assuming point above).  But now, how “search” could know what is custom
> build and what is not?  Somehow, “search” should scan all the store to
> be able to update the database.
>
> And what happens each time I am doing a custom build then “filesearch”.
> The database should be updated, right?  Well, it seems almost unusable.

I mentioned this previously: we need to update the database on "guix
build".  This is very fast and would be mostly transparent to the user.
This is essentially how "guix size" behaves.

> The model “updatedb/locate” seems better.  The user updates “manually”
> if required and then location is fast.

"manually" is not good in my opinion.  The end-user will inevitably
forget.  An out-of-sync database would return bad results which is a
big no-no for search.  On-demand database updates are ideals I think.

> To me, each time I am using “filesearch”:
>
>  - first time: fetch the database corresponding the Guix commit and then
>  update it with my local store

Possibly using a "diff" to shrink the download size.

>  - otherwise: use this database
>  - optionally update the database if the user wants to include new
>  custom items.

No need for the optional point I believe.

> We could imagine a hook or option to “guix pull” specifying to also
> fetch the database and update it at pull time instead of “search” time.
> Personally, I prefer longer “guix pull” because it is already a bit long
> and then fast “search” than half/half (not so long pull and longer
> search).

I suggest we do it at pull time so that =guix search= does not need an
online network.  =guix pull= requires networking anyways.

>> - Find a way to garbage-collect the database(s).  My intuition is that
>>   we should have 1 database per Guix checkout and when we `guix gc` a
>>   Guix checkout we collect the corresponding database.
>
> Well, the exact same strategy as
> ~/.config/guix/current/lib/guix/package.cache can be used.

Oh!  I didn't know about this file!  What is it used for?

> BTW, thanks Pierre for improving the Guix discoverability. :-)

Thank you! :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-10-10 Thread zimoun

Hi,

On Mon, 05 Oct 2020 at 20:53, Pierre Neidhardt  wrote:

> - Textual database: slow and not lighter than SQLite.  Not worth it I believe.

Maybe I am out-of-scope, but re-reading *all* the discussion about
“fileserch”, is it possible to really do better than “locate”? As
Ricardo mentioned.

--8<---cut here---start->8---
echo 3 > /proc/sys/vm/drop_caches
time updatedb --output=/tmp/store.db --database-root=/gnu/store/

real0m19.903s
user0m1.549s
sys 0m4.500s

du -sh /gnu/store /tmp/store.db
30G /gnu/store
56M /tmp/store.db

guix gc -F XXG
echo 3 > /proc/sys/vm/drop_caches
time updatedb --output=/tmp/store.db --database-root=/gnu/store/

real0m10.105s
user0m0.865s
sys 0m2.020s

du -sh /gnu/store /tmp/store.db
28G /gnu/store
52M /tmp/store.db
--8<---cut here---end--->8---

And then “locate” support regexp and regex and it is fast enough.

--8<---cut here---start->8---
echo 3 > /proc/sys/vm/drop_caches
time locate -d /tmp/store.db --regex "emacs-ma[a-z0-9\.\-]+\/[^.]+.el$" | tail 
-n5
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-transient.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-utils.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-wip.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit-worktree.el
/gnu/store/zawdnn1hhf4a2nscgw7rydkws383dl4l-emacs-magit-2.90.1-6.7f486d4/share/emacs/site-lisp/magit.el

real0m3.601s
user0m3.528s
sys 0m0.061s
--8<---cut here---end--->8---

The only point is that regexp is always cumbersome for me.  Well: «Some
people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems.» :-) [1]

[1] https://en.wikiquote.org/wiki/Jamie_Zawinski

> - Include synopsis and descriptions.  Maybe we should include all fields
>   that are searched by `guix search`.  This incurs a cost on the
>   database size but it would fix the `guix search` speed issue.  Size
>   increases by some 10 MiB.

>From my point of view, yes.  Somehow “filesearch” is a subpart of
“search”.  So it should be the machinery.

> I say we go with SQLite full-text search for now with all package
> details.  Switching to without full-text search is just a matter of a
> minor adjustment, which we can decide later when merging the final
> patch.  Same if we decide not to include the description, synopsis, etc.

[...]

> - Populate the database on demand, either after a `guix build` or from a
>   `guix filesearch...`.  This is important so that `guix filesearch`
>   works on packages built locally.  If `guix build`, I need help to know
>   where to plug it in.

[...]

> - Sync the databases from the substitute server to the client when
>   running `guix filesearch`.  For this I suggest we send the compressed
>   database corresponding to a guix generation over the network (around
>   10 MiB).  Not sure sending just the delta is worth it.

>From my point of view, how to transfer the database from substitutes to
users and how to locally update (custom channels or custom load path) are
not easy.  Maybe the core issues.

For example, I just did “guix pull” and “–list-generation” says from
f6dfe42 (Sept. 15) to 4ec2190 (Oct. 10)::

   39.9 MB will be download

more the tiny bits before “Computing Guix derivation”.  Say 50MB max.

Well, the “locate” database for my “/gnu/store” (~30GB) is already to
~50MB, and ~20MB when compressed with gzip.  And Pierre said:

  The database will all package descriptions and synopsis is 46 MiB
  and compresses down to 11 MiB in zstd.

which is better but still something.  Well, it is not affordable to
fetch the database with “guix pull”, IMHO.

Therefore, the database would be fetched at the first “guix search”
(assuming point above).  But now, how “search” could know what is custom
build and what is not?  Somehow, “search” should scan all the store to
be able to update the database.

And what happens each time I am doing a custom build then “filesearch”.
The database should be updated, right?  Well, it seems almost unusable.

The model “updatedb/locate” seems better.  The user updates “manually”
if required and then location is fast.

Most of the cases are searching files in packages that are not my custom
packages.  IMHO.

To me, each time I am using “filesearch”:

 - first time: fetch the database corresponding the Guix commit and then
 update it with my local store
 - otherwise: use this database
 - optionally update the database if the user wants to include new
 custom items.

We could imagine a hook or option to “guix pull” specifying to also
fetch the database and update it at pull time instead of “search” time.

Re: File search progress: database review and question on triggers

2020-10-10 Thread zimoun

Hi,

On Sat, 10 Oct 2020 at 10:57, Pierre Neidhardt  wrote:
> Of course, `guix filesearch` hasn't been implemented yet ;)

Sorry, I have overlooked the status. :-)


> We still need to decide whether we want to make it part of `guix search'
> or define a separate command.

>From my point of view, it would be better to include it under “guix
search”.  Even, from my point of view, the code about “search” in “guix
package” should go in “guix search”.  Well do the opposite of the
current: “guix package –search” becoming an alias of “guix search” in
term of code.  Anyway. :-)


Cheers,
simon

Re: File search progress: database review and question on triggers

2020-10-10 Thread Pierre Neidhardt

Of course, `guix filesearch` hasn't been implemented yet ;)

We still need to decide whether we want to make it part of `guix search'
or define a separate command.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-10-09 Thread zimoun

Hi Pierre,

On Mon, 5 Oct 2020 at 20:53, Pierre Neidhardt  wrote:

> Comments and help welcome! :)

I have just checked out and I am probably failing but "./pre-inst-env
guix filesearch whatever" raises an error with the backtrace:

--8<---cut here---start->8---
Backtrace:
In ice-9/boot-9.scm:
  1736:10  5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
   4 (apply-smob/0 #)
In ice-9/boot-9.scm:
718:2  3 (call-with-prompt _ _ #)
In ice-9/eval.scm:
619:8  2 (_ #(#(#)))
In guix/ui.scm:
  2038:22  1 (run-guix-command filesearch "hello")
In ice-9/boot-9.scm:
  2828:12  0 (module-ref _ _ . _)

ice-9/boot-9.scm:2828:12: In procedure module-ref:
No variable named guix-filesearch in #
--8<---cut here---end--->8---

I will comment later on some specific points.

BTW, thank you for revivifying the topic, especially the database. :-)

Cheers,
simon

Re: File search progress: database review and question on triggers

2020-10-05 Thread Pierre Neidhardt

Hi Ludo!

Ludovic Courtès  writes:

> Nice!

Thanks!

> Could you post a summary of what you have done, what’s left to do, and
> how you’d like to integrate it?  (If you’ve already done it, my
> apologies, but you can resend a link.  :-))

What I've done: mostly a database benchmark.

- Textual database: slow and not lighter than SQLite.  Not worth it I believe.

- SQLite without full-text search: fast, supports classic patterns
  (e.g. "foo*bar") but does not support word permutations.

- SQLite with full-text search: fast, supports word permutations but
  does not support suffix-matching (e.g. "bar" won't match "foobar").
  Size is about the same as without full-text search.

- Include synopsis and descriptions.  Maybe we should include all fields
  that are searched by `guix search`.  This incurs a cost on the
  database size but it would fix the `guix search` speed issue.  Size
  increases by some 10 MiB.

I say we go with SQLite full-text search for now with all package
details.  Switching to without full-text search is just a matter of a
minor adjustment, which we can decide later when merging the final
patch.  Same if we decide not to include the description, synopsis, etc.

What's left to do:

- Populate the database on demand, either after a `guix build` or from a
  `guix filesearch...`.  This is important so that `guix filesearch`
  works on packages built locally.  If `guix build`, I need help to know
  where to plug it in.

- Adapt Cuirass so that it builds its file database.
  I need pointers to get started here.

- Sync the databases from the substitute server to the client when
  running `guix filesearch`.  For this I suggest we send the compressed
  database corresponding to a guix generation over the network (around
  10 MiB).  Not sure sending just the delta is worth it.

- Find a way to garbage-collect the database(s).  My intuition is that
  we should have 1 database per Guix checkout and when we `guix gc` a
  Guix checkout we collect the corresponding database.

  I would store the databases in /var/guix/...

Comments and help welcome! :)

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-10-05 Thread Ludovic Courtès

Hi,

Pierre Neidhardt  skribis:

> I've pushed a commit which adds the synopsis and the description to the
> database.
>
> 0127cfa5d089857a716bf7b0a167f31cc6dd
>
> The quite surprising result is that these new details only cost 1 MiB extra!
>
> I can then search packages and it's super fast:
>
>> ,time (search-package "org" "equival") 
> $51 = (("emacs-org-contrib"))
> ;; 0.002930s real time, 0.002925s run time.  0.00s spent in GC.
>
> My local database only includes the store items, that is to say some
> 1800 items, but even with the 15000 packages the search would not be
> impacted by much.  The size might grow by some 10 MiB though, but that's
> reasonable and it compresses very well (down to 6-7 MiB in zstd).

Nice!

Could you post a summary of what you have done, what’s left to do, and
how you’d like to integrate it?  (If you’ve already done it, my
apologies, but you can resend a link.  :-))

Thank you,
Ludo’.

Re: File search progress: database review and question on triggers

2020-09-06 Thread Arun Isaac


> There is a subtle different: in the latter, (search-file-package) is
> allowed and won't trigger a compile time error, while the former does.
> This "foo . more-foo" paradigm is a way to say "1 or more arguments",
> instead of "0 or more".

Ah, ok, that makes sense!

Regards,
Arun


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-09-06 Thread Arun Isaac


> Oops!  Forgot to push.
>
> It's actually commit 25147f983bdf432b03e8271abe0318f4812f94ba on
> wip-filesearch.

I checked and it works now. Just a tiny nitpick:

The function signature of search-file-package,

--8<---cut here---start->8---
(define (search-file-package pattern . more-patterns) ...)
--8<---cut here---end--->8---

can be rewritten as

--8<---cut here---start->8---
(define (search-file-package . patterns) ...)
--8<---cut here---end--->8---

No need to cons pattern onto more-patterns in the function body.

Thanks for working on this!


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-09-04 Thread Arun Isaac



Hi Pierre!

Sorry for the very late response.

> I've fixed it in
> e08f913d20428a9a925cc46d177c7446f55e6443.  The downside is that we can't
> use any special character like boolean operators.  I'm not sure how we
> can get the best of both worlds.  Maybe add a command line flag that
> would enable "extended search syntax" in which the user can use all of
> FTS but must make sure to quote "libm.so" for instance.

Where is this commit so I may test it? I couldn't find the commit in the
wip-filesearch branch. The HEAD of the wip-filesearch branch I see is
49b52c2c7be03caf3636632c31f4451d5bc88125.

Regards,
Arun

Re: File search progress: database review and question on triggers

2020-08-27 Thread Pierre Neidhardt

zimoun  writes:

> I am not sure to see how.  One needs all the database to search inside
> and cannot know in advance which packages etc..  Contrary to “guix size”
> which manipulates the graph and then download the missing parts.
>
> Therefore, your suggestion is to download all the database the first
> time the user run “guix filesearch”, i.e., download ~10MiB.

Yes.
If the user is not using substitutes, they can also compute the database
from their local store items (like I'm doing in the graft).  It's less
useful but still a bit more convenient than running find/locate on
/gnu/store.

> Then each time the user runs “guix pull” then “guix filesearch”, two
> options, either download the new database for this last Guix generation,
> or either download a diff (not sure the complexity is worth for ~10MiB).
>
> Right?

Yes.

> And what about the channels?  Because if I read correctly, “guix size”
> fails when “no available substitute information“.

Just like `guix size', it would work with local items.  But if there is
no substitute server for channels, then there is no way around it.

>> I think this is a bit beyond the scope of this patch set.  I'd rather
>> focus on files exclusively for now and proceed one step at a time :)
>
> I do not think it is beyond the scope because Arun introduced an SQLite
> database for improving “guix search”.  But this path had been stopped
> because of “introducing complexity” [1].  Therefore, if “guix
> filesearch” introduces a SQL cache, then it seems a good idea to be also
> usable by “guix search”.
>
> Well, if the table is extended with the fields “synopsis” and
> “description” then what is the size of the database?  Does it kill the
> lookup performance?

Good point, I'll test and report my measures.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-27 Thread zimoun

Hi Pierre,

On Thu, 27 Aug 2020 at 13:15, Pierre Neidhardt  wrote:

> zimoun  writes:
>
>> If you are going to an local SQL database, my two questions are:
>>
>> a)
>> Which part would update it?  “guix pull”?  Other?  Even using
>> substitutes, the channels and co could lead to an extra cost and so what
>> is acceptable and what is not?
>
> I suggest fetching database updates when performing the filesearch, just
> like `guix size` does.

I am not sure to see how.  One needs all the database to search inside
and cannot know in advance which packages etc..  Contrary to “guix size”
which manipulates the graph and then download the missing parts.

Therefore, your suggestion is to download all the database the first
time the user run “guix filesearch”, i.e., download ~10MiB.

Then each time the user runs “guix pull” then “guix filesearch”, two
options, either download the new database for this last Guix generation,
or either download a diff (not sure the complexity is worth for ~10MiB).

Right?

And what about the channels?  Because if I read correctly, “guix size”
fails when “no available substitute information“.

>> b)
>> Could you also include other fields such that “synopsis” and
>> “description”?  Because it could speed up “guix search” without adding
>> (or modifying) the current cache
>> (~/config/guix/current/lib/package.cache); discussed at length in
>> #39258 .
>
> I think this is a bit beyond the scope of this patch set.  I'd rather
> focus on files exclusively for now and proceed one step at a time :)

I do not think it is beyond the scope because Arun introduced an SQLite
database for improving “guix search”.  But this path had been stopped
because of “introducing complexity” [1].  Therefore, if “guix
filesearch” introduces a SQL cache, then it seems a good idea to be also
usable by “guix search”.

Well, if the table is extended with the fields “synopsis” and
“description” then what is the size of the database?  Does it kill the
lookup performance?

[1] http://issues.guix.gnu.org/issue/39258#7

Cheers,
simon

Re: File search progress: database review and question on triggers

2020-08-27 Thread Pierre Neidhardt

Hi Simon!

zimoun  writes:

> If you are going to an local SQL database, my two questions are:
>
> a)
> Which part would update it?  “guix pull”?  Other?  Even using
> substitutes, the channels and co could lead to an extra cost and so what
> is acceptable and what is not?

I suggest fetching database updates when performing the filesearch, just
like `guix size` does.

The induced cost would be little in my opinion.  A compressed database
for a whole generation is below 10 MiB.  This operation would need to be
done only once per Guix generation.  If we are a bit smart with timestamping
and only send a diff, it could even be much, much smaller.

> b)
> Could you also include other fields such that “synopsis” and
> “description”?  Because it could speed up “guix search” without adding
> (or modifying) the current cache
> (~/config/guix/current/lib/package.cache); discussed at length in
> #39258 .

I think this is a bit beyond the scope of this patch set.  I'd rather
focus on files exclusively for now and proceed one step at a time :)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/

signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-27 Thread zimoun

Hi Pierre,

Cool to relive the topic. :-)

(disclaim: I have not yet process all the thread and emails)


On Mon, 10 Aug 2020 at 16:32, Pierre Neidhardt  wrote:

> 1. An SQLite database with the following schema:

If you are going to an local SQL database, my two questions are:

a)
Which part would update it?  “guix pull”?  Other?  Even using
substitutes, the channels and co could lead to an extra cost and so what
is acceptable and what is not?

b)
Could you also include other fields such that “synopsis” and
“description”?  Because it could speed up “guix search” without adding
(or modifying) the current cache
(~/config/guix/current/lib/package.cache); discussed at length in
#39258 .



Cheers,
simon

Re: File search progress: database review and question on triggers

2020-08-24 Thread Pierre Neidhardt

>> - You should use SQL prepared statements with sqlite-prepare,
>>   sqlite-bind, etc. That would correctly handle escaping special
>>   characters in the search string. Currently, searching for
>>   "transmission-gtk", "libm.so", etc. errors out.
>
> Thanks for pointing this out, I'll look into it.


I've looked into it.
Actually the issue is not with the lack of sqlite-bind-arguments, but
with "-" and "." triggering an error in FTS.

I've fixed it in
e08f913d20428a9a925cc46d177c7446f55e6443.  The downside is that we can't
use any special character like boolean operators.  I'm not sure how we
can get the best of both worlds.  Maybe add a command line flag that
would enable "extended search syntax" in which the user can use all of
FTS but must make sure to quote "libm.so" for instance.

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers OFF TOPIC PRAISE

2020-08-18 Thread Joshua Branson



Thanks for working on this!  This is a super awesome feature!  Best of luck!

-- 
Joshua Branson
Sent from Emacs and Gnus

Re: File search progress: database review and question on triggers

2020-08-16 Thread Hartmut Goebel

Am 15.08.20 um 23:20 schrieb Bengt Richter:
> If you are on debian, have you tried
> dpkg -l '*your*globbed*name*here*'

No, since for Debian I'm not using a command-line tool, but the
Web-Interface - which allows querying even packages I have not
installed. (And the later is my specific use-case: Find which package I
need to install to get a specific file.)

-- 
Regards
Hartmut Goebel

| Hartmut Goebel  | h.goe...@crazy-compilers.com   |
| www.crazy-compilers.com | compilers which you thought are impossible |

Re: File search progress: database review and question on triggers

2020-08-15 Thread Bengt Richter

Hi Hartmut, et al

On +2020-08-15 14:47:12 +0200, Hartmut Goebel wrote:
> Am 13.08.20 um 12:04 schrieb Pierre Neidhardt:
> > SQLite pattern search queries are extremely fast (<0.1s) and cover all
> > examples named so far:
> >
> > - exact basename match
> > - partial path match
> > - pattern match (e.g. "/include/%foo%")
> 
> 
> For comparison: These are the options Debian Package search
>  supports:
> 
>   * paths ending with the keyword  -> e.g. file-extensions
>   * packages that contain files named like this -> basename
>   * packages that contain files whose names contain the keyword  ->
> LIKE  %%
>

If you are on debian, have you tried
dpkg -l '*your*globbed*name*here*'
or
dpkg -l '*'|egrep your-regex-here
or the former, preserving header and excluding non-installed, e.g.,?
dpkg -l '*your*globbed*name*here*'|grep -v ^un

> -- 
> Regards
> Hartmut Goebel
> 
> | Hartmut Goebel  | h.goe...@crazy-compilers.com   |
> | www.crazy-compilers.com | compilers which you thought are impossible |
> 
> 
-- 
Regards,
Bengt Richter

Re: File search progress: database review and question on triggers

2020-08-15 Thread Arun Isaac


Hi Pierre,

I tried the wip-filesearch branch. Nice work! :-)
persist-all-local-packages takes around 350 seconds on my machine (slow
machine with spinning disk) and the database is 50 MB. Some other
comments follow.

- Maybe, we shouldn't index hidden files, particularly all the .xxx-real
  files created by our wrap phases.

- You should use SQL prepared statements with sqlite-prepare,
  sqlite-bind, etc. That would correctly handle escaping special
  characters in the search string. Currently, searching for
  "transmission-gtk", "libm.so", etc. errors out.

- Searching for "git perl5" works as expected, but searching for "git
  perl" returns no results. I think this is due to the tokenizer used by
  the full text search indexer. The tokenizer sees the word "perl5" as
  one indivisible token and does not realize that "perl" is a prefix of
  "perl5". Unfortunately, I think this is a fundamental problem with FTS
  -- one that can only be fixed by using simple LIKE patterns. FTS is
  meant for natural language search where this kind of thing would be
  normal.

- I guess you are only indexing local packages now, but will include all
  packages later by some means.

Cheers!


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-15 Thread Hartmut Goebel

Am 11.08.20 um 14:35 schrieb Pierre Neidhardt:
> Unlike Nix, we would like to do more than just index executable files.
> Indeed, it's very useful to know where to find, say, a C header, a .so
> library, a TeXlive .sty file, etc.

+1

Most of the time I'm searching for non-executable files.

-- 
Regards
Hartmut Goebel

| Hartmut Goebel  | h.goe...@crazy-compilers.com   |
| www.crazy-compilers.com | compilers which you thought are impossible |

Re: File search progress: database review and question on triggers

2020-08-15 Thread Hartmut Goebel

Am 13.08.20 um 12:04 schrieb Pierre Neidhardt:
> SQLite pattern search queries are extremely fast (<0.1s) and cover all
> examples named so far:
>
> - exact basename match
> - partial path match
> - pattern match (e.g. "/include/%foo%")


For comparison: These are the options Debian Package search
 supports:

  * paths ending with the keyword  -> e.g. file-extensions
  * packages that contain files named like this -> basename
  * packages that contain files whose names contain the keyword  ->
LIKE  %%

-- 
Regards
Hartmut Goebel

| Hartmut Goebel  | h.goe...@crazy-compilers.com   |
| www.crazy-compilers.com | compilers which you thought are impossible |

Re: File search progress: database review and question on triggers

2020-08-13 Thread Pierre Neidhardt

I've pushed my experiment to the `wip-filesearch' branch.

As of this writing it is not automatically triggered by "guix build".
To test it:

- Load the module from a REPL.

- Run

--8<---cut here---start->8---
  (test-index-git)
--8<---cut here---end--->8---

  You should get a database at ~/.cache/guix/files.db.

- Run

--8<---cut here---start->8---
  (format-search (search-file-package "git perl5"))
--8<---cut here---end--->8---

  to get some search results.

- Run

--8<---cut here---start->8---
  (persist-all-local-packages)
--8<---cut here---end--->8---

  to build the database for all your local files.  Warning, this takes
  some 30 seconds on my machine, could take minutes if you has a
  spinning drive or an old processor.

I'll be away for a couple of days, feel free to experiment and send me
some feedback in the meantime.

What remains to be done:

- Automatically index "guix build" result in database.
  Any idea where I can trigger the indexing from a "guix build"?

- Sync the database data from a substitute server.  Any pointer would be
  appreciated :)

Cheers!

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-13 Thread Pierre Neidhardt

Arun Isaac  writes:

> But filenames usually don't have diacritics. So, I'm not sure if
> diacritic insensitivity is useful.

Probably not, but if there ever is this odd file name with an accent,
then we won't have to worry about it, it will be handled.  Better too
much than too little!

> This is handled by stemming. We'll need a custom stemmer that normalizes
> libfoo to foo. Xapian has a nice page on stemming. See
> https://xapian.org/docs/stemming.html

I think I gave a confusing example.  I think stemming is too specific.
I meant searching a pattern that does not start the word.
If I search "bar", I'd like to match "foobar".

The reverse is possible ("foo" matches "foobar").

Well, I'll share my FTS demo just now, we can always think about this case 
later.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-13 Thread Arun Isaac


> Yes, but full text search brings us a few niceties here:

These are nice features, but I don't know if all of them are useful for
file search. Normally, with Arch's pkgfile, I seach for some missing
header file, shared library, etc. Usually, I know the exact filename I
am looking for, or I know some prefix or suffix of the exact filename.

> - Case insensitive, diacritic insensitive (e.g. "e" matches "É").

Case insensitivity is quite useful. Most filenames are in lower case,
but there is always that one odd filename out there.

But filenames usually don't have diacritics. So, I'm not sure if
diacritic insensitivity is useful.

> All the above is arguably more powerful and easier to use than regexp.
> But even if no user ever bothers with the logic operators, the default
> behaviour "just works" in the fashion of a search engine.
>
> The main thing I don't know how to do is suffix matches (like "%foo").
> With FTS, looking up "foo*" won't match "libfoo", which is problematic.
> Any clue how to fix that?

This is handled by stemming. We'll need a custom stemmer that normalizes
libfoo to foo. Xapian has a nice page on stemming. See
https://xapian.org/docs/stemming.html

Cheers!


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-13 Thread Pierre Neidhardt

Ricardo Wurmus  writes:

> Pierre Neidhardt  writes:
>
>> - Or do you think SQLite patterns (using "%") would do for now?  As
>>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>>   Guix.  But maybe regexp support can be added in a second stage.
>
> These patterns could be generated from user input that looks more like
> what we already use elsewhere.  We don’t have to directly pass the
> SQLite interface to users.

True, an obvious fix to the problem of syntax familiarity would be to
replace "*" by "%" in the user query, to have it look like shell globing.

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-13 Thread Arun Isaac


>> sqlite insert statements can be very fast. sqlite.org claims 5
>> or...
>
> I tried it, and got it down to... 30s!

That's great! :-)

> - Or do you think SQLite patterns (using "%") would do for now?  As
>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>   Guix.  But maybe regexp support can be added in a second stage.

The inconsistency is unfortunate. Personally, I am in favor of dropping
regexp support everywhere in Guix, and only having literal string
search. But that is backward incompatible, and may be controversial.

In this specific case of file search, we could use the sqlite like
patterns, but not expose them to the user. For example, if the search
query is "", we search for the LIKE pattern "%%". I think
this addresses how users normally search for files. I don't think
regexps add much value.

Full text search may not be relevant to file search. Full text search is
more suited for natural language search involving such things as
stemming algorithms.


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-13 Thread Ricardo Wurmus

Pierre Neidhardt  writes:

> - Or do you think SQLite patterns (using "%") would do for now?  As
>   Mathieu pointed out, it's an unfortunate inconsistency with the rest of
>   Guix.  But maybe regexp support can be added in a second stage.

These patterns could be generated from user input that looks more like
what we already use elsewhere.  We don’t have to directly pass the
SQLite interface to users.

-- 
Ricardo

Re: File search progress: database review and question on triggers

2020-08-13 Thread Ricardo Wurmus

Pierre Neidhardt  writes:

> Julien Lepiller  writes:
>
>> Why wouldn't it help? Can't you make it a trie from basename ->
>> complete name? If I'm looking for "libcord.so" (which is a key in the
>> trie), I don't think I need to look for every path. I only need to
>> follow the trie until I find a pointer to some structure that contains
>> the data I look for (ex: a list of complete filenames).
>
> Fair enough, but it's a more limited scope: here we assume the user
> knows the exact basename.
>
> It's a bit too limited in my opinion:
>
> - It's only too common to have shared objects ending with a .X.Y extension
>   (where X.Y is a version), the version-less file is not always present
>   which means a lot of trial and error on the user end just to search
>   the right file.

We can always abort the radix trie traversal early and print all leaves
that can be reached from the node that we arrived at.  This allows us to
search for “libc” and find “libcord.so.1.2”, “libc.so”, “libcaca.a”,
etc.

> - It does not cover the case where I don't know the basename, e.g. if I'm
>   looking for a FOO header file my query would look like "/include/.*foo.*".

I think this is a rather rare use case, which in my opinion doesn’t
justify forgoing the use of a smart data structure.  It would *still* be
possible to use the prefix tree for this kind of search, but it would
certainly not be optimal. (E.g. by searching up to the first wildcard,
and then searching each resulting sub-tree up to the first non-wildcard,
and resuming the search in all remaining sub-trees.)

> I believe it's important that the search be as general as possible.

Search should be cheap and fast.  Dependent on how we arrange the data,
users can easily filter the results with grep, if necessary.  For
example, we could have a radix tree of the base names and have the
directory prefix as the leaf nodes.  The search would then involve two
steps: culling the set of directory prefixes by following the base name
trie (that’s fast), then do a simple linear search over the results
(that’s slow).

The trie can easily be serialized as a bunch of offsets that we only
need to repeatedly seek to, so we never need to hold it in memory.

-- 
Ricardo

Re: File search progress: database review and question on triggers

2020-08-13 Thread Pierre Neidhardt

Hi Ricardo,

Ricardo Wurmus  writes:

>> Why wouldn't it help? Can't you make it a trie from basename -> complete 
>> name? If I'm looking for "libcord.so" (which is a key in the trie), I don't 
>> think I need to look for every path. I only need to follow the trie until I 
>> find a pointer to some structure that contains the data I look for (ex: a 
>> list of complete filenames).
>
> Exactly!  It’s somewhat less useful (without pre-processing) for base
> names and more useful for absolute file names, where you can quickly
> toss out mismatches while traversing the trie.
>
> You could also optimize the data structure by swapping the base name and
> the directory name, assuming that users would want to search for either
> absolute file names and also for base names (and not for a directory in
> between).

By absolute file name, you mean relative to the Guix store item?  As I
mentioned in my reply to Julien, this would be a specific use-case
optimization, like you said it won't work when you don't know the exact
basename or full path, so I'm not sure it's worth going down that road.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-12 Thread Arun Isaac


Hi,

> 1. I tried to fine-tune the SQL a bit:
>   - Open/close the database only once for the whole indexing.
>   - Use "insert" instead of "insert or replace".
>   - Use numeric ID as key instead of path.
>
>   Result: Still around 15-20 minutes to build.  Switching to numeric
>   indices shrank the database by half.

sqlite insert statements can be very fast. sqlite.org claims 5 or
more insert statements per second. But in order to achieve that speed
all insert statements have to be grouped together in a single
transaction. See https://www.sqlite.org/faq.html#q19

>   A string-contains filter takes less than 1 second.

Guile's string-contains function uses a naive O(nk) implementation,
where 'n' is the length of string s1 and 'k' is the length of string
s2. If it was implemented using the Knuth-Morris-Pratt algorithm, it
could cost only O(n+k). So, there is some scope for improvement here. In
fact, a comment on line 2007 of libguile/srfi-13.c in the guile source
tree makes this very point.

> I need to measure the time SQL takes for a regexp match.

sqlite, by default, does not come with regexp support. You might have to
load some external library. See
https://www.sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators

--8<---cut here---start->8---
The REGEXP operator is a special syntax for the regexp() user
function. No regexp() user function is defined by default and so use of
the REGEXP operator will normally result in an error message. If an
application-defined SQL function named "regexp" is added at run-time,
then the "X REGEXP Y" operator will be implemented as a call to
"regexp(Y,X)".
--8<---cut here---end--->8---

Regards,
Arun


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-12 Thread Ricardo Wurmus

Julien Lepiller  writes:

> Why wouldn't it help? Can't you make it a trie from basename -> complete 
> name? If I'm looking for "libcord.so" (which is a key in the trie), I don't 
> think I need to look for every path. I only need to follow the trie until I 
> find a pointer to some structure that contains the data I look for (ex: a 
> list of complete filenames).

Exactly!  It’s somewhat less useful (without pre-processing) for base
names and more useful for absolute file names, where you can quickly
toss out mismatches while traversing the trie.

You could also optimize the data structure by swapping the base name and
the directory name, assuming that users would want to search for either
absolute file names and also for base names (and not for a directory in
between).

-- 
Ricardo

Re: File search progress: database review and question on triggers

2020-08-12 Thread Julien Lepiller

Why wouldn't it help? Can't you make it a trie from basename -> complete name? 
If I'm looking for "libcord.so" (which is a key in the trie), I don't think I 
need to look for every path. I only need to follow the trie until I find a 
pointer to some structure that contains the data I look for (ex: a list of 
complete filenames).

On 2020年8月12日 16:43:37 GMT-04:00, Pierre Neidhardt  wrote:
>Julien Lepiller  writes:
>
>> Have you tried something more structured? I have some code for
>creating a binary search tree and even compressing/decompressing
>strings with huffman, as well as code to serialize all that (my
>deserialization is in Java though, so not very useful to you):
>https://framagit.org/nani-project/nani-website
>>
>> See modules/nani/trie.scm for instance.
>>
>> The idea is to have a binary search tree whose keys are filenames and
>value is a pointer in the file to a structure that holds data for this
>filerame (packages and versions of guix for instance). It's super fast
>to look it up, because you don't load the whole file, you only seek to
>the right position. It's a lot longer to build, and not so easy to
>update though.
>
>Assuming we'd have only 1 Guix generation per file, I'm not sure a trie
>would help because we _always_ need to search _all_ file paths, so in
>all cases we've got some 10,000+ entries to load in memory and loop
>over
>them.
>
>The total number of entries is the bottleneck here, both for the
>database load and the individual search queries.
>
>An obvious optimization is to memoize the database load.  This has a
>significant memory cost though.
>
>The trie could be helpful for multiple Guix generations in the same
>database, but I'm not sure it warrant the increased complexity.
>
>Thoughts?
>
>-- 
>Pierre Neidhardt
>https://ambrevar.xyz/

Re: File search progress: database review and question on triggers

2020-08-12 Thread Pierre Neidhardt

Pierre Neidhardt  writes:

>   Result: Takes between 20 and 2 minutes to complete and the result is
>   32 MiB big.  (I don't know why the timing varies.)

Typo: 20 _seconds_ to 2 minutes!  So it's faster than SQL by 1 or 2 orders of 
magnitude.

-- 
Pierre Neidhardt
https://ambrevar.xyz/

signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-12 Thread Julien Lepiller

Have you tried something more structured? I have some code for creating a 
binary search tree and even compressing/decompressing strings with huffman, as 
well as code to serialize all that (my deserialization is in Java though, so 
not very useful to you): https://framagit.org/nani-project/nani-website

See modules/nani/trie.scm for instance.

The idea is to have a binary search tree whose keys are filenames and value is 
a pointer in the file to a structure that holds data for this filerame 
(packages and versions of guix for instance). It's super fast to look it up, 
because you don't load the whole file, you only seek to the right position. 
It's a lot longer to build, and not so easy to update though.

On 2020年8月12日 15:10:08 GMT-04:00, Pierre Neidhardt  wrote:
>I've done some benchmarking.
>
>1. I tried to fine-tune the SQL a bit:
>  - Open/close the database only once for the whole indexing.
>  - Use "insert" instead of "insert or replace".
>  - Use numeric ID as key instead of path.
>
>  Result: Still around 15-20 minutes to build.  Switching to numeric
>  indices shrank the database by half.
>
>2. I've tried with the following naive 1-file-per-line format:
>
>--8<---cut here---start->8---
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man5/acl.5.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_add_perm.3.gz"
>"/gnu/store/97p5gvb7jglmn9jpgwwf5al1798wi61f-acl-2.2.53//share/man/man3/acl_calc_mask.3.gz"
>...
>--8<---cut here---end--->8---
>
>  Result: Takes between 20 and 2 minutes to complete and the result is
>  32 MiB big.  (I don't know why the timing varies.)
>
>  A string-contains filter takes less than 1 second.
>
>  A string-match (regex) search takes some 3 seconds (Ryzen 5 @ 3.5
>  GHz).  I'm not sure if we can go faster.  I need to measure the time
>  SQL takes for a regexp match.
>
>Question: Any idea how to load the database as fast as possible?  I
>tried the following, it takes 1.5s on my machine:
>
>--8<---cut here---start->8---
>(define (load-textual-database)
>  (call-with-input-file %textual-db
>(lambda (port)
>  (let loop ((line (get-line port))
> (result '()))
>(if (string? line)
>(loop (get-line port) (cons line result))
>result)
>--8<---cut here---end--->8---
>
>Cheers!
>
>--
>Pierre Neidhardt
>https://ambrevar.xyz/

Re: File search progress: database review and question on triggers

2020-08-11 Thread Ricardo Wurmus

Pierre Neidhardt  writes:

> Pierre Neidhardt  writes:
>
>> Ricardo Wurmus  writes:
>>
>>> I’m not suggesting to use updatedb, but I think it can be instructive to
>>> look at how the file database is implemented there.  We don’t have to
>>> use SQlite if it is much slower and heavier than a custom inverted
>>> index.
>>
>> Good call, I'll benchmark against an inverted index.
>>
>> Some cost may also be induced by the Guix store queries, not sure if we
>> can optimize these.
>
> With an s-exp based file, or a trivial text-based format, the downside
> is that it needs a bit of extra work to only load select entries,
> e.g. just the entries matching a specific Guix version.
>
> Would you happen to know a serialization library that allows for loading
> only a select portion of a file?

I don’t know of any suitable file format, but a generated offset index
at the beginning of the file could be useful.  You’d read the first
expression and then seek to the specified byte offset (after the
position of the index expression) where you then read the target
expression.  This can easily be generated and it can be extended without
having to rewrite the whole file.

But perhaps that’s premature optimization.

-- 
Ricardo

Re: File search progress: database review and question on triggers

2020-08-11 Thread Pierre Neidhardt

Ricardo Wurmus  writes:

> Oof.  The updatedb hack above takes 6 seconds on my i7-6500U CPU @
> 2.50GHz with SSD.
>
> I’m not suggesting to use updatedb, but I think it can be instructive to
> look at how the file database is implemented there.  We don’t have to
> use SQlite if it is much slower and heavier than a custom inverted
> index.

Good call, I'll benchmark against an inverted index.

Some cost may also be induced by the Guix store queries, not sure if we
can optimize these.

--
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-11 Thread Ricardo Wurmus

Pierre Neidhardt  writes:

> 3. Size of the database:
>I've persisted all locally-present store items for my current Guix version
>and it produced a database of 72 MiB.  It compresses down to 8 MiB
>in zstd.

For comparison, my laptop’s store contains 1,103,543 files, excluding
.links 691,994.  The updatedb database for all of them is 86MB and takes
~6 seconds to generate:

time updatedb \
 --localpaths=/gnu/store \
 --findoptions='( -path /gnu/store/.links -o -name *.drv -o -name 
*.chroot ) -prune -o -type f -print' \
 --output=/tmp/dbfile

locate -d /tmp/dbfile ecxc0800

(This could be further tweaked to exclude links…)

>The worse case is around (number of guix generations) x ~100 MiB.

This seems a little excessive.

> 4. Indexing speed:
>The above items took some 20 minutes to complete (on my rather
>powerful machine).

Oof.  The updatedb hack above takes 6 seconds on my i7-6500U CPU @
2.50GHz with SSD.

I’m not suggesting to use updatedb, but I think it can be instructive to
look at how the file database is implemented there.  We don’t have to
use SQlite if it is much slower and heavier than a custom inverted
index.

-- 
Ricardo

Re: File search progress: database review and question on triggers

2020-08-11 Thread Pierre Neidhardt

Hi Mathieu,

Thanks for you comments!
Answers below.

>> 3. Size of the database:
>>I've persisted all locally-present store items for my current Guix version
>>and it produced a database of 72 MiB.  It compresses down to 8 MiB in 
>> zstd.
>>
>>But since we can have multiple Guix versions, this means that the
>>packages have one entry per store path, so we might end up with more
>>entries than that as the number of Guix generations grows.
>
> I'm not sure we actually need to save the full history. I think we could
> just store that the package X produces [Y1, Y2, ...]  executable files. Then
> on package X update, the executable files list could be refreshed.

Maybe you are missing some context.  The original discussion is there:
https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00019.html.

Unlike Nix, we would like to do more than just index executable files.
Indeed, it's very useful to know where to find, say, a C header, a .so
library, a TeXlive .sty file, etc.

However, as you hinted, maybe it's unnecessary to save the file listings of
the packages for every Guix versions.  Maybe we could only store the
"diffs" between the Guix generations.  I don't know if SQLite supports
this.  If not, it sounds like a rather complex thing to do.

But really, if the compressed database over multiple Guix generations is
<100 MiB, then size is not a big problem.

>>Question: Should we include empty directories in the database?  I'm 
>> tempted
>>to answer no.
>
> I would also say no, and also exclude non-executable files.

See above, I think we would lose a lot in not including non-executable files.

>>Question: This bounds us to the SQLite syntax for pattern matching.  Is 
>> it a
>>problem?
>>It seems powerful enough in practice.  But maybe we can use regular
>>expression in SQLite as well?
>
> From the UI perspective, we already have "guix search" that expects a
> regex. If we were to include a "guix file-search" command, then I think
> it would make sense that it uses the same regex syntax.

I found out that SQLite has a REGEXP operator, I'll see if it works well enough.

>> 7. Have substitute servers distribute database content.  When the user 
>> performs
>>a file search, Guix asks the substitute server for a database update.  
>> Only
>>the diff should be sent over the network, not the whole thing since it 
>> might
>
> If I understand correctly, you are proposing to create local databases
> that would be kept in sync with a master database populated by the CI
> server. This seems a bit complex.
>
> What about extending Cuirass database to add the two tables you are
> proposing. Then, each time a package is built, if the version is
> updated, the "Files" table would be updated.
>
> Then we could add an HTTP interface such as "/search/file?query=libxxx"
> to Cuirass, that would directly query the database. In Guix itself, we
> could add the counterpart in the (guix ci) module.

The problem with this approach is that it would not work offline.  In my
opinion, this is a big limitation.  I'd rather have a local database.

Besides, we need a local database for non-official, locally-built
packages anyways (which Cuirass would not know about).  Since this is a
requirement, the only piece that'd be missing is database synchronization.

Thoughts?

-- 
Pierre Neidhardt
https://ambrevar.xyz/


signature.asc
Description: PGP signature

Re: File search progress: database review and question on triggers

2020-08-11 Thread Mathieu Othacehe



Hello Pierre,

Thanks for sharing your progress. A few remarks below.

> 3. Size of the database:
>I've persisted all locally-present store items for my current Guix version
>and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.
>
>But since we can have multiple Guix versions, this means that the
>packages have one entry per store path, so we might end up with more
>entries than that as the number of Guix generations grows.

I'm not sure we actually need to save the full history. I think we could
just store that the package X produces [Y1, Y2, ...]  executable files. Then
on package X update, the executable files list could be refreshed.

>Question: Should we include empty directories in the database?  I'm tempted
>to answer no.

I would also say no, and also exclude non-executable files.

>Question: This bounds us to the SQLite syntax for pattern matching.  Is it 
> a
>problem?
>It seems powerful enough in practice.  But maybe we can use regular
>expression in SQLite as well?

>From the UI perspective, we already have "guix search" that expects a
regex. If we were to include a "guix file-search" command, then I think
it would make sense that it uses the same regex syntax.

> Next points I'd like to address:
>
> 6. Automatically persist the database entry when building a package.
>Any idea where I should plug that in?
>
> 7. Have substitute servers distribute database content.  When the user 
> performs
>a file search, Guix asks the substitute server for a database update.  Only
>the diff should be sent over the network, not the whole thing since it 
> might

If I understand correctly, you are proposing to create local databases
that would be kept in sync with a master database populated by the CI
server. This seems a bit complex.

What about extending Cuirass database to add the two tables you are
proposing. Then, each time a package is built, if the version is
updated, the "Files" table would be updated.

Then we could add an HTTP interface such as "/search/file?query=libxxx"
to Cuirass, that would directly query the database. In Guix itself, we
could add the counterpart in the (guix ci) module.

WDYT?

Thanks,

Mathieu

File search progress: database review and question on triggers

2020-08-10 Thread Pierre Neidhardt

Hi!

After much delay I finally got down to work on file search support for Guix.
By "file search", I mean the ability to find which package contains files
matching the queried pattern.

If we want to be able to know which package to install, we need file search to
be able to work for packages that have not yet been installed nor built
on the system.

As we previously discussed, a good approach, mostly for performance reasons,
would be to store all files in a database that's populated on demand from the
substitute servers.

What I've done so far:

1. An SQLite database with the following schema:

--8<---cut here---start->8---
create table if not exists Packages (
nametext not null,
output  text default "out",
system  text not null,
pathtext primary key not null, -- store path, e.g. 
/gnu/store/abcd...-foo
version text not null,
guixtext not null  -- The Guix version in which the package
can be found.
);

create table if not exists Files (
subpath text not null,
package  text not null,
primary key (subpath, package), -- Same subpath can occur in multiple 
packages.
foreign key (package) references Packages(path) on delete cascade
);
--8<---cut here---end--->8---

   I'm not very good with SQL, so thanks in advance for reviewing this
   carefully; let me know if we can do better.

2. A procedure that persists the filepaths of a given package in the database.

3. Size of the database:
   I've persisted all locally-present store items for my current Guix version
   and it produced a database of 72 MiB.  It compresses down to 8 MiB in zstd.

   But since we can have multiple Guix versions, this means that the
   packages have one entry per store path, so we might end up with more
   entries than that as the number of Guix generations grows.

   The worse case is around (number of guix generations) x ~100 MiB.

   If we compress, it would be at least 10x less, maybe way less.

   To be sustainable, I suggest that when we remove a Guix generation we
   "garbage-collect" the corresponding database entries.

   Thoughts?

4. Indexing speed:
   The above items took some 20 minutes to complete (on my rather powerful 
machine).
   A single store path takes a fraction of a second to index (on an SSD).
   The storage device is the bottleneck here.  Not sure we can do better than
   the following procedure:

--8<---cut here---start->8---
(define (directory-files path)
  "Return a list of all files within PATH, recursively.
Each file is returned as the path relative to PATH, starting with a '/'.

It's important that the first character be the directory separator because it
gives more expressive power for search.  For instance, searching \"/bin\"
matches both \"/bin/foo\" and \"/usr/bin/foo\" but not \"barbin\"."
  ;; TODO: This does not include empty directories.  Should we?
  ;; REVIEW: Use vlist for performance?  Big packages take a fraction of a
  ;; second on a hot cache, so it's probably not worth it.
  (let ((file-list '()))
(ftw path
 (lambda (filename statinfo flag)
   (when (eq? flag 'regular)
 (set! file-list (cons (string-drop filename (string-length path))
   file-list))) #t))
file-list))
--8<---cut here---end--->8---

   Most of the indexing will be done by the substitute servers however, so this 
is
   of little concern for the end user.

   Question: Should we include empty directories in the database?  I'm tempted
   to answer no.

5. Search speed: It completes in a fraction of a second and supports
   SQLite patterns.  Example:

--8<---cut here---start->8---
> (format-search (search-file-package "%libio%"))
samba:out@4.12.3/lib/libiov-buf-samba4.so
guix:out@1.1.0-18.218a67d   
/share/guile/site/3.0/gnu/packages/patches/m4-gnulib-libio.patch
--8<---cut here---end--->8---

   Question: This bounds us to the SQLite syntax for pattern matching.  Is it a
   problem?
   It seems powerful enough in practice.  But maybe we can use regular
   expression in SQLite as well?


Next points I'd like to address:

6. Automatically persist the database entry when building a package.
   Any idea where I should plug that in?

7. Have substitute servers distribute database content.  When the user performs
   a file search, Guix asks the substitute server for a database update.  Only
   the diff should be sent over the network, not the whole thing since it might
   be very large.

   Question 1: If the substitute server does not have data corresponding to the
   Guix server of the user, shall we send data of the version that's the closest
   to that of the user?
   Locally, if there are not many entries for the current Guix version, but many
   for an

53 matches

Mail list logo