[appengine-java] Re: Inverted Index

Dmitriy T. Thu, 26 Aug 2010 17:32:43 -0700

Hi.

I did something like your description. I beleive its called "boolean
queries" on inverted index, but not sure - pretty bad knowledge of
terminology. But i have many small documents (every document is just
small set of movie titles), not the real big text documents. I think
it works well for my case, but i still working on it. You can see what
i have here: http://movieshelf.appspot.com/ . Login via google acc and
click on "Add" link in the left panel. Don't typing in the text field,
just copypaste something in it and press search button(or maybe you
dont  need press button - not sure). You'l see result and time spended
on search. If typing in textbox result time can be wrong because i try
to use some suggestion technics and actually i not sure that it works
right on this moment... All search results cached, so you need enter
other titles for 2nd, 3rd etc searches. In my datastore now about 300K
movies, don't know how many titles total(movie can have many titles),
but ~800Mb of datastore used on this moment, about 50%(according to
Datastore Statistics) of it - inverted index.


On Aug 26, 11:07 am, Lars Borup Jensen <[email protected]> wrote:
> Hi guys,
>
> Since there is no full-text search available in GAE/j and I really
> need this for a new app I am writing I have made a prototype
> implementation of an inverted index using GAE store.
>
> Term is stored as a key with actual term as name in key (only key is
> needed)
> Below each term I've added document references as another key like
> this Term("term")/DocumentRef("10") where 10 is the internal document
> number.
> An example:
>
> Term("stuff")
>   DocRef("1")
>   DocRef("2")
>
> Term("more")
>   DocRef("1")
>
> When searching for e.g. "more stuff" (which is boolean and) I do this:
>
> Query DocRef's from the Term with the least doc-refs (children, this
> info is cached) and load keys into a sorted set.
> Then query for doc-refs under the second term filtering from the min.
> doc-id in the sorted set and the max doc-id (meaning we only get
> possible matches in the docs we've know contains the first term.
> Merge sets.
>
> What do you think? Is this a fair way to implement this (working on
> scoring using tf-idf) and do you think its possible to get it to
> perform well?
>
> /Lars Borup

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

[appengine-java] Re: Inverted Index

Reply via email to