from:"J. Delgado"

Re: Maximum score estimation

2022-12-19 Thread J. Delgado

Actually, I believe that the Lucene scoring function is based on *Okapi
BM25* (BM is an abbreviation of best matching) which is based on the
probabilistic
retrieval 
framework
developed in the 1970s and 1980s by Stephen E. Robertson
, Karen Spärck Jones
, and others.

There are several interpretations for IDF and slight variations on its
formula. In the original BM25 derivation, the IDF component is derived from
the Binary Independence Model
.

Info from:
https://en.m.wikipedia.org/wiki/Okapi_BM25

You could calculate an ideal score, but that can change every time a
> document is added to or deleted from the index, because of idf. So the
> ideal score isn’t a useful mental model.
>
> Essentially, you need to tell your users to worry about something that
> matters. The absolute value of the score does not matter.
>

While I understand the concern, quite often BM25 scores are used post
retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank
models that often transform the score into [0,1] using some normalization
function that often  involves estimating a max score by looking at the
score distribution.

J

On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood 
wrote:

> That article is copied from the old wiki, so it is much earlier than 2019,
> more like 2009. Unfortunately, the links to the email discussion are all
> dead, but the issues in the article are still true.
>
> If you really want to go down that path, you might be able to do it with a
> similarity class that implements a probabilistic relevance model. I’d start
> the literature search with this Google query.
>
> probablistic information retrieval
> 
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev  wrote:
>
> Thanks for replym Walter.
> Recently Robert commented on PR with the link
> https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages it
> gives arguments against my proposal. Honestly, I'm still in doubt.
>
> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood 
> wrote:
>
>> As you point out, this is a probabilistic relevance model. Lucene uses a
>> vector space model.
>>
>> A probabilistic model gives an estimate of how relevant each document is
>> to the query. Unfortunately, their overall relevance isn’t as good as a
>> vector space model.
>>
>> You could calculate an ideal score, but that can change every time a
>> document is added to or deleted from the index, because of idf. So the
>> ideal score isn’t a useful mental model.
>>
>> Essentially, you need to tell your users to worry about something that
>> matters. The absolute value of the score does not matter.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev  wrote:
>>
>> Hello dev!
>> Users are interested in the meaning of absolute value of the score, but
>> we always reply that it's just relative value. Maximum score of matched
>> docs is not an answer.
>> Ultimately we need to measure how much sense a query has in the index.
>> e.g. [jet OR propulsion OR spider] query should be measured like
>> nonsense, because the best matching docs have much lower scores than
>> hypothetical (and assuming absent) doc matching [jet AND propulsion AND
>> spider].
>> Could it be a method that returns the maximum possible score if all query
>> terms would match. Something like stubbing postings on virtual all_matching
>> doc with average stats like tf and field length and kicks scorers in? It
>> reminds me something about probabilistic retrieval, but not much. Is there
>> anything like this already?
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>>
>>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
>

Re: Vector based store and ANN

2019-03-02 Thread J. Delgado

>
>
>
> Yes, the idea is the same idea with an extra step that Rene also seems to
> elude to here
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRenKriegler%2Fa-picture-is-worth-a-thousand-words-93680178=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636870794499767411=BRJS4wkx7vRY8CX%2FiPvvltx41uy%2BwBAwtMEEoE1Gcag%3D=0>
>  in his comment. Instead of using these types of techniques only at the
> scoring time, we can use them for information retrieval from the index.
> This will allow us to, for example, index millions of images and quickly
> and efficiently lookup the most relevant images.
>
>
>
> I would love to hear yours and others thoughts on this. I think there is a
> great opportunity here, but it would need a lot of input and guidance from
> the experts here.
>
>
>
> Thank you,
>
>
>
> Pedram
>
>
>
> *From:* David Smiley 
> *Sent:* Friday, March 1, 2019 12:11 PM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) ; Arun
> Sacheti ; Kun Wu ; Junhua Wang <
> junhua.w...@microsoft.com>; Jason Li ; René Kriegler
> 
> *Subject:* Re: Vector based store and ANN
>
>
>
> This presentation by Rene Kriegler at Haystack 2018 was a real eye-opener
> to me on this subject: https://haystackconf.com/2018/relevance-scoring/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhaystackconf.com%2F2018%2Frelevance-scoring%2F=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395=TbYqHGyZ4Cq6Zhx8FSr9ES90GVw%2BkHo7r5epAVYLlog%3D=0>.
>  Uses
> random-projection forests which is a very clever technique.  (CC'ing Rene)
>
>
>
> ~ David
>
> On Fri, Mar 1, 2019 at 1:30 PM Pedram Rezaei <
> pedr...@microsoft.com.invalid> wrote:
>
> Hi there,
>
>
>
> Thank you for the responses. Yes, we have a few scenarios in mind that can
> benefit from a vector-based index optimized for ANN searches:
>
>
>
>- Advanced, optimized, and high precision visual search: For this to
>work, we would convert the images to their vector representations and then
>use algorithms and implementations such as SPTAG
>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMicrosoft%2FSPTAG=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499777395=qRd%2B5ieCH2duJVxBxHbj4rVy03cHhbW2QxFGLJ6F%2BNs%3D=0>
>, FAISS
>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookresearch%2Ffaiss=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389=%2BWivx1i5cTAypkWJUaWXLq32ShZ9ncPEIuUzcV5lqtk%3D=0>,
>and HNSWLIB
>
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D=0>
>.
>- Advanced document retrieval: Using a numerical vector representation
>of a document, we could improve the search result
>- Nearest neighbor queries: discovering the nearest neighbors to a
>given query could also benefit from these ANN algorithms (although doesn’t
>necessarily need the vector based index)
>
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
>
>
> Thanks,
>
>
>
> Pedram
>
>
>
> *From:* J. Delgado 
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) 
> *Subject:* Re: Vector based store and ANN
>
>
>
> Lucene’s scoring function (which I believe is okapi BM25
>
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499797379=E0%2BLqnkwPxvJlL2ENYKgv0HDQxyPkB6iRw467PMBmRY%3D=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
>
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand  wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing

Re: Vector based store and ANN

2019-03-01 Thread J. Delgado

gt;of a document, we could improve the search result
>- Nearest neighbor queries: discovering the nearest neighbors to a
>given query could also benefit from these ANN algorithms (although doesn’t
>necessarily need the vector based index)
>
>
>
> I would be grateful to hear your thoughts and whether the community is
> open to a conversation on this topic with my team.
>
>
>
> Thanks,
>
>
>
> Pedram
>
>
>
> *From:* J. Delgado 
> *Sent:* Thursday, February 28, 2019 7:38 AM
> *To:* dev@lucene.apache.org
> *Cc:* Radhakrishnan Srikanth (SRIKANTH) 
> *Subject:* Re: Vector based store and ANN
>
>
>
> Lucene’s scoring function (which I believe is okapi BM25
>
> https://en.m.wikipedia.org/wiki/Okapi_BM25
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.m.wikipedia.org%2Fwiki%2FOkapi_BM25=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908774009=UsNUOOH88fog95sKTM%2FkgjYak5%2Bp%2F%2BWaMZYsMAgQ5MA%3D=0>)
> is a kind of nearest neighbor using the TF-IDF vector representation of
> documents and query. Are you interested in ANN to be applied to a different
> kind of vector representation, say for example Doc2Vec?
>
>
>
> On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand  wrote:
>
> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Flucene-solr%2Fblob%2Fmaster%2Flucene%2Fsandbox%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fdocument%2FFloatPointNearestNeighbor.java=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908784014=XrrdrkhWOHp8%2FYLGowJK5%2B3km0f04Nr6BxPFxbiRQdM%3D=0>
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
>  wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
>
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
>
> LinkedIn: http://linkedin.com/in/davidwsmiley
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flinkedin.com%2Fin%2Fdavidwsmiley=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023=rmLY5WMZtQCZ99yumefC%2BQoglS4JeONfLShsj5qaWkU%3D=0>
> | Book: http://www.solrenterprisesearchserver.com
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.solrenterprisesearchserver.com=02%7C01%7Cpedramr%40microsoft.com%7Cd4ac932962eb42ef813e08d69e8216cd%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870678908794023=DZslOJYShNLZ9GOSpstuq85F%2FwVrFtnZIVDiXe%2F%2B0fw%3D=0>
>

Re: Vector based store and ANN

2019-02-28 Thread J. Delgado

Lucene’s scoring function (which I believe is okapi BM25
https://en.m.wikipedia.org/wiki/Okapi_BM25) is a kind of nearest neighbor
using the TF-IDF vector representation of documents and query. Are you
interested in ANN to be applied to a different kind of vector
representation, say for example Doc2Vec?

On Thu, Feb 28, 2019 at 5:59 AM Adrien Grand  wrote:

> Hi Pedram,
>
> We don't have much in this area, but I'm hearing increasing interest
> so it'd be nice to get better there! The closest that we have is this
> class that can search for nearest neighbors for a vector of up to 8
> dimensions:
> https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/document/FloatPointNearestNeighbor.java
> .
>
> On Wed, Feb 27, 2019 at 1:44 AM Pedram Rezaei
>  wrote:
> >
> > Hi there,
> >
> >
> >
> > Is there a way to store numerical vectors (vector based index) and
> perform search based on Approximate Nearest Neighbor class of algorithms in
> Lucene?
> >
> >
> >
> > If not, has there been any interests in the topic so far?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Pedram
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Parallel Scoring

2019-02-01 Thread J. Delgado

Hi folks,

Assuming documents can be scored independently, what is the level of
document scoring parallelism (thread or process wise) that have people
experimented with on a single multi-core machine containing a single shard?

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread J. Delgado

What about the use of word embeddings (see
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
to compute word similarity?

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Hey folks,
>
> I wanted to open up a discussion about a change to the usage of
> SynonymQuery. The goal here is to have a broader library of queries that
> can address other cases where related terms occupy the same position but
> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
> ambiguous terms, and other query expansion situations).
>
>
> I bring this up because we've noticed (as I'm sure many of you have) the
> pattern of clients jamming any related term into a synonyms file and being
> surprised with odd results. I like the idea of enforcing "synonyms" means
> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
> place works great.
>
> But I feel if that's the rule, we need to open up discussion of other
> methods of scoring conceptual 'related term' relationships that usually
> comes up in the context of query expansion. This paper (
> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
> the current thinking for scoring various query expansion scenarios like
> those we deal with in the messy, ambiguous uses of synonyms in prod systems
> (khakis aren't trousers, they're a kind-of trouser).
>
>
> The cool thing is many of the ideas in this paper seem doable with
> existing Lucene index stats. So one might imagine a 'related terms' token
> filter that injected some scoring based on how related it really is to
> the original query term using Jaccard, Dice, or other methods called out in
> this paper.
>
>
> Another insightful set of research is this article on concept scoring (
> https://usabilityetc.com/articles/information-retrieval-concept-matching/),
> which prioritizes related terms by connectedness and other factors.
>
> Needless to say, it's an open area how two terms someone has asserted are
> related to a query term 'should be' scored. It's one of those things that
> likely will forever depend on a number of domain and application specific
> factors. It's possibly a big opportunity of improvement for Lucene - but
> likely is about putting the right framework in place to allow for good
> default set of query-expansion scoring scenarios with options for
> customization.
>
> What I'm proposing is:
>
>
>-
>
>Submit a small patch that restricts SynonymQuery to tokens of type
>"SYNONYM" in the same posn, which allows some short term work to be done
>with the current Lucene QueryBuilder. Any additional non-synonym terms
>would be appended as a boolean query for now
>-
>
>Begin work on alternate 'related-term' scoring systems that also key
>off the token type in QueryBuilder to create custom scoring using built-in
>term stats. The possibilities here are endless, up to weighted related
>terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc
>
>
> I'm curious what folks would think of a patch for bullet one followed by
> other patches down the road for additional functionality?
>
> (related to discussion in this Elasticsearch PR
>
> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
> )
>
> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>

Word Embedding stored in Lucene Index

2017-12-09 Thread J. Delgado

It has been a couple of years since the Neu-IR WS (
https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/craswell-report-2016.pdf).
I'm wondering if anyone has tinkered with storing word/document embeddings
and using inside Lucene to improve the core relevance model.


One of the key ideas of neural search is to leverage such representations
in order to improve the effectiveness of search engines. It would be very
nice if we could have a retrieval model that relies on word and document
vectors (also called *embeddings*) with the above capabilities, so we could
calculate and leverage document and word similarities very efficiently by
looking at the "nearest neighbours".


I found this code that can generate word2vec from a Lucene index:

https://github.com/kojisekig/word2vec-lucene


But the closest work along the lines of using DL in Lucene is this paper
about "Large Scale Indexing and Searching Deep Convolutional Neural Network
Features" (https://link.springer.com/chapter/10.1007/978-3-319-43946-4_14)
that applies mainly to content-based image retrieval.


-- J

CFP RecSysTV 2015

2015-05-08 Thread J. Delgado

Apologies for any cross-posting. Please distribute to colleagues who may be
interested.
-- Joaquin (on behalf of the Organizers)

CALL FOR PAPERS

2nd Workshop on Recommender Systems for Television and Online Video
http://www.recsys.tv

We are pleased to invite you to participate in the 2nd Workshop on
Recommender Systems for Television and Online Video (RecSysTV 2015) that is
happening in conjunction with the ACM RecSys 2015 conference in Vienna,
Austria from September 16th-20th 2015.

For many households the television is still the central entertainment hub
in their home, and the average TV viewer spends about half of their leisure
time in front of a TV (3-5 hours/day). The choice of what to watch becomes
more overwhelming though because the entertainment options are scattered
across various channels, such as on-demand video, digital recorders (on
premise or in the cloud) and the traditional linear TV. In addition,
consumers can also access the content not just on the big screen, but also
on their computers, phones, and tablet devices.

Recommendation systems provide TV users with suggestions about both online
video-on-demand and broadcast content and help them to search and browse
intelligently for content that is relevant to them.While many open
questions in video-on-demand recommendations have already been solved,
recommendation systems for broadcast content (e.g., linear channels and
catch-up TV) still experience a number of unique challenges due to the
peculiarity of such domain. For example, the content available on linear
channels is constantly changing and often only available once which leads
to severe cold start problems and we often consume TV in groups of varying
compositions (household vs individual) which makes building taste profiles
and modeling consumer behavior very challenging.

We encourage participation along several themes which include but are not
limited to:

** Context-aware TV and online video recommendations
   * Leveraging contextual viewing behaviour, e.g. device specific
recommendations
   * Mood based recommendations
   * Group recommendations
** User modeling  leveraging user viewing and interaction behavior
   * How can social media improve TV recommendations
   * Cross-domain recommendation algorithms (linear TV, video on demand,
DVR, gaming consoles)
   * Multi-viewer profile separation
   * Evaluation metrics for TV and online video recommendations
** Content-based TV and online video recommendations
   * Analysis techniques for video recommendations based on video, audio,
or closed caption signals
   * Utilization of external data sources (movie reviews, ratings, plot
summaries) for recommendations
** Other topics related to TV and online video recommendations
   * Video playlisting
   * Linear TV usage and box office success prediction
   * Catch-up TV recommendations
   * Personalized advertisement recommendations
   * Recommendations of 2nd screen web content
   * Recommendations of short form videos (previews, trailers, music videos)

IMPORTANT DATES

- Submission deadline: June 29, 2015
- Notification: July 20, 2015
- Camera-ready: July 27, 2015
- Deadline for author registration: August 16, 2015
- Workshop date: September 19, 2015 (full day)

SUBMISSION INFORMATION

We are soliciting submissions of long and short papers, as well as position
presentations.

Long paper are to represent original mature research and can be 6-8 pages
long. We request potential submitters to adhere to double-column ACM SIG
format in line with standard RecSys formatting guidelines.

Short papers are to represent early/promising research, demos or industrial
case studies and can be 4 pages in length (ACM RecSys style) or up to 20
slides.

Use the following website to electronically submit your paper:
https://cmt.research.microsoft.com/RECSYSTV2015/

Note that attendance at the workshop requires registration for the ACM
RecSys 2015 conference as a whole. This year there is no separate
registration for workshops. Each accepted workshop paper must register at
least one author at the conference.

ORGANIZING COMMITEE

Jan Neumann, Comcast Labs, Washington, DC (jan_neum...@cable.comcast.com)
John Hannon, Zalando
Roberto Turrin, ContentWise, Milan, Italy (roberto.tur...@contentwise.tv)
Danny Bickson, Dato, Seattle, WA (bick...@dato.com)
Hassan Sayyadi, Comcast Labs, Washington, DC (
hassan.sayy...@cable.comcast.com)

PROGRAM COMMITEE

Hidasi Balazs, GravityRD
Justin Basilico, Netflix
Craig Carmichael, Rovi
Emanuele Coviello, Keevio
Paolo Cremonesi, Politecnico di Milano
Joaquin Delgado, OnCue TV (Verizon)
Christophe Diot, Technicolor
Diana Hu, OnCue TV (Verizon)
Brendan Kitts, Adapt.TV (AOL)
Gert Lanckriet, UC San Diego
Royi Ronen, Microsoft
Barry Smyth, Insight Centre for Data Analytics
Udi Weinsberg, Technicolor Labs
Esti Widder, Viaccess-Orca
Jiayu Zhou, Samsung Research
David Zibriczky, ImpressTv

For up to date information about the workshop, please see the workshop
website at

Game time

2015-05-05 Thread J. Delgado

on the 16 Alex has baseball game at 12:30

Re: Game time

2015-05-05 Thread J. Delgado

Sorry mistake ...

On Tuesday, May 5, 2015, J. Delgado joaquin.delg...@gmail.com wrote:

 on the 16 Alex has baseball game at 12:30

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado

Sorry, as I was saying, the machine learning approach, is NOT limited to
having lots of user action data. In fact having little or no user action
data is commonly referred to as the cold start problem in recommender
systems. In which case, it is useful to exploit content based similarities
as well as context (such as location, time-of-the-day, day-of-the-week,
site-section, device type, etc) to make predictions/scoring. This can still
be combined with the usual IR based scoring to keep semantics as the
driving force.

-J

On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com wrote:

 BTW, as i mentioned, the machine learning

 On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com
 javascript:_e(%7B%7D,'cvml','joaquin.delg...@gmail.com'); wrote:

 I totally agree that it depends at the task at hand and the
 amount/quality of the data that you can get hold of.

 The problem of relevancy in traditional document/semantic information
 retrieval (IR) task is such a hard thing because there is little or no
 source of truth you could use as training data (unless you you something
 like TREC for a limited set of documents to evaluate) in most cases.
 Additionally the feedback data you get from users, if it exists, is very
 noisy. It this case prior knowledge, encoded as attributes-weights, crafted
 functions, and heuristics is your best bet. You can however mine the
 content itself by leveraging clustering/topic modeling via LDA which is
 unsupervised learning algorithm and use that as input. Or perhaps
 Labeled-LDA and Multi-Grain LDA, another topic model for classification and
 sentiment analysis, which are supervised algorithms, in which case you can
 still use the approach I suggested.

 However, for search tasks that involve e-commerce, advertisements,
 recommendations, etc., there seems to be more data that can be captured
 from users interactions with the system/site, that can be used as signals
 and users' actions (adding things to wish lists, clicks for more info,
 conversions, etc.) is much more telling about the intention/values the user
 give to what is presented to them. Then viewing search as a machine
 learning/multi-objective optimization problem makes sense.

 My point is that search engines nowadays is used for all these use cases,
 thus it is worth exploring all the venues exposed in this thread.

 Cheers,

 -- Joaquin

 On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West tburt...@umich.edu
 wrote:

 Hi Doug and Joaquin,

 This is a really interesting discussion.  Joaquin, I'm looking forward
 to taking your code for a test drive.  Thank you for making it publicly
 available.

 Doug,  I'm interested in your pyramid observation.  I work with academic
 search which has some of the problems unique queries/information needs and
 of data sparsity you mention in your blog post.

 This article makes a similar argument that massive amounts of user data
 are so important for modern search engines that it is essentially a barrier
 to entry for new web search engines.
 Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
 Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
 http://www.springerlink.com/index/58255K40151U036N.pdf

  Tom


 I noticed that information retrieval problems fall into a sort-of
 layered pyramid. At the topmopst point is someone like Google where the
 sheer amount of high quality user behavior data that search truly is a
 machine learning problem, much as you propose. As you move down the pyramid
 the quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado

I totally agree that it depends at the task at hand and the amount/quality
of the data that you can get hold of.

The problem of relevancy in traditional document/semantic information
retrieval (IR) task is such a hard thing because there is little or no
source of truth you could use as training data (unless you you something
like TREC for a limited set of documents to evaluate) in most cases.
Additionally the feedback data you get from users, if it exists, is very
noisy. It this case prior knowledge, encoded as attributes-weights, crafted
functions, and heuristics is your best bet. You can however mine the
content itself by leveraging clustering/topic modeling via LDA which is
unsupervised learning algorithm and use that as input. Or perhaps
Labeled-LDA and Multi-Grain LDA, another topic model for classification and
sentiment analysis, which are supervised algorithms, in which case you can
still use the approach I suggested.

However, for search tasks that involve e-commerce, advertisements,
recommendations, etc., there seems to be more data that can be captured
from users interactions with the system/site, that can be used as signals
and users' actions (adding things to wish lists, clicks for more info,
conversions, etc.) is much more telling about the intention/values the user
give to what is presented to them. Then viewing search as a machine
learning/multi-objective optimization problem makes sense.

My point is that search engines nowadays is used for all these use cases,
thus it is worth exploring all the venues exposed in this thread.

Cheers,

-- Joaquin

On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West tburt...@umich.edu wrote:

 Hi Doug and Joaquin,

 This is a really interesting discussion.  Joaquin, I'm looking forward to
 taking your code for a test drive.  Thank you for making it publicly
 available.

 Doug,  I'm interested in your pyramid observation.  I work with academic
 search which has some of the problems unique queries/information needs and
 of data sparsity you mention in your blog post.

 This article makes a similar argument that massive amounts of user data
 are so important for modern search engines that it is essentially a barrier
 to entry for new web search engines.
 Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
 Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
 http://www.springerlink.com/index/58255K40151U036N.pdf

  Tom


 I noticed that information retrieval problems fall into a sort-of layered
 pyramid. At the topmopst point is someone like Google where the sheer
 amount of high quality user behavior data that search truly is a machine
 learning problem, much as you propose. As you move down the pyramid the
 quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado

BTW, as i mentioned, the machine learning

On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com wrote:

 I totally agree that it depends at the task at hand and the amount/quality
 of the data that you can get hold of.

 The problem of relevancy in traditional document/semantic information
 retrieval (IR) task is such a hard thing because there is little or no
 source of truth you could use as training data (unless you you something
 like TREC for a limited set of documents to evaluate) in most cases.
 Additionally the feedback data you get from users, if it exists, is very
 noisy. It this case prior knowledge, encoded as attributes-weights, crafted
 functions, and heuristics is your best bet. You can however mine the
 content itself by leveraging clustering/topic modeling via LDA which is
 unsupervised learning algorithm and use that as input. Or perhaps
 Labeled-LDA and Multi-Grain LDA, another topic model for classification and
 sentiment analysis, which are supervised algorithms, in which case you can
 still use the approach I suggested.

 However, for search tasks that involve e-commerce, advertisements,
 recommendations, etc., there seems to be more data that can be captured
 from users interactions with the system/site, that can be used as signals
 and users' actions (adding things to wish lists, clicks for more info,
 conversions, etc.) is much more telling about the intention/values the user
 give to what is presented to them. Then viewing search as a machine
 learning/multi-objective optimization problem makes sense.

 My point is that search engines nowadays is used for all these use cases,
 thus it is worth exploring all the venues exposed in this thread.

 Cheers,

 -- Joaquin

 On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West tburt...@umich.edu
 javascript:_e(%7B%7D,'cvml','tburt...@umich.edu'); wrote:

 Hi Doug and Joaquin,

 This is a really interesting discussion.  Joaquin, I'm looking forward to
 taking your code for a test drive.  Thank you for making it publicly
 available.

 Doug,  I'm interested in your pyramid observation.  I work with academic
 search which has some of the problems unique queries/information needs and
 of data sparsity you mention in your blog post.

 This article makes a similar argument that massive amounts of user data
 are so important for modern search engines that it is essentially a barrier
 to entry for new web search engines.
 Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
 Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
 http://www.springerlink.com/index/58255K40151U036N.pdf

  Tom


 I noticed that information retrieval problems fall into a sort-of
 layered pyramid. At the topmopst point is someone like Google where the
 sheer amount of high quality user behavior data that search truly is a
 machine learning problem, much as you propose. As you move down the pyramid
 the quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!

Re: Where Search Meets Machine Learning

2015-05-02 Thread J. Delgado

supervised training data over good-enough features. This can be hard
to do for a broad swatch of middle tier search applications, but
increasingly useful as scale goes up. I'd be interested to hear your
thoughts on this article
http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/
I wrote about collecting click tracking and other relevance feedback data:

Good stuff! Again, thanks for sharing,
-Doug

On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com
wrote:

Here is a presentation on the topic:

http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

Search can be viewed as a combination of a) A problem of constraint
satisfaction, which is the process of finding a solution to a set of
constraints (query) that impose conditions that the variables (fields) must
satisfy with a resulting object (document) being a solution in the feasible
region (result set), plus b) A scoring/ranking problem of assigning values
to different alternatives, according to some convenient scale. This
ultimately provides a mechanism to sort various alternatives in the result
set in order of importance, value or preference. In particular scoring in
search has evolved from being a document centric calculation (e.g. TF-IDF)
proper from its information retrieval roots, to a function that is more
context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
takes user parameters for personalization) as well as other factors that
depend on the domain and task at hand. However, most system that
incorporate machine learning techniques to perform classification or
generate scores for these specialized tasks do so as a post retrieval
re-ranking function, outside of search! In this talk I show ways of
incorporating advanced scoring functions, based on supervised learning and
bid scaling models, into popular search engines such as Elastic Search and
potentially SOLR. I'll provide practical examples of how to construct such
ML Scoring plugins in search to generalize the application of a search
engine as a model evaluator for supervised learning tasks. This will
facilitate the building of systems that can do computational advertising,
recommendations and specialized search systems, applicable to many domains.

Code to support it (only elastic search for now):
https://github.com/sdhu/elasticsearch-prediction

-- J

--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Where Search Meets Machine Learning

2015-04-29 Thread J. Delgado

Here is a presentation on the topic:
http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

Code to support it (only elastic search for now):
https://github.com/sdhu/elasticsearch-prediction

-- J

Re: Welcome back, Wolfgang Hoschek!

2013-09-26 Thread J. Delgado

Percolator for Solr? :{

On Thursday, September 26, 2013, Otis Gospodnetic wrote:

 Another welcome back!  Any specific area where you plan on contributing?

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Fri, Sep 27, 2013 at 12:58 AM, Wolfgang Hoschek
 whosc...@cloudera.com javascript:; wrote:
  Thanks to all! Looking forward to more contributions.
 
  Wolfgang.
 
  On Sep 26, 2013, at 3:21 AM, Uwe Schindler wrote:
 
  Hi,
 
  I'm pleased to announce that after a long abstinence, Wolfgang Hoschek
 rejoined the Lucene/Solr committer team. He is working now at Cloudera and
 plans to help with the integration of Solr and Hadoop.
  Wolfgang originally wrote the MemoryIndex, which is used by the
 classical Lucene highlighter and ElasticSearch's percolator module.
 
  Looking forward to new contributions.
 
  Welcome back  heavy committing! :-)
  Uwe
 
  P.S.: Wolfgang, as soon as you have setup your subversion access, you
 should add yourself back to the committers list on the website as well.
 
  -
  Uwe Schindler
  uschind...@apache.org javascript:;
  Apache Lucene PMC Chair / Committer
  Bremen, Germany
  http://lucene.apache.org/
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.orgjavascript:;
  For additional commands, e-mail: dev-h...@lucene.apache.orgjavascript:;
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org javascript:;
  For additional commands, e-mail: dev-h...@lucene.apache.orgjavascript:;
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org javascript:;
 For additional commands, e-mail: dev-h...@lucene.apache.org javascript:;

Re: Indexing Boolean Expressions

2013-02-11 Thread J. Delgado

I guess ElasticSearch went ahead of SOLR with the percolate API, which is
exactly what is needed for two-way constraint+doc matching problem present
in Advertising systems and other use cases:

http://www.elasticsearch.org/guide/reference/api/percolate.html

Cheers,

Joaquin Delgado, PhD.
http://www.linkedin.com/pub/profile/0/04b/277

On Mon, Mar 26, 2012 at 10:17 AM, Walter Underwood wun...@wunderwood.orgwrote:

 Efficient rule matching goes further back, at least to alerting in
 Verity K2.

 wunder
 Search Guy, Chegg

 On Mar 26, 2012, at 10:15 AM, J. Delgado wrote:

 BTW, the idea of indexing Boolean Expressions inside a text indexing
 engine is not new. For example Oracle Text provides the CTXRULE index and
 the MATCHES operator within their indexing stack, which is primarily used
 for Rule-based text classification.

 See:

 http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm#autoId8

 http://docs.oracle.com/cd/B28359_01/text.111/b28303/classify.htm#g1011013

 -- J

 On Mon, Mar 26, 2012 at 10:07 AM, J. Delgado joaquin.delg...@gmail.comwrote:

 In full dislosure, there is a patent application that Yahoo! has filed
 for the use of inverted indexes for using complex  predicates for matching
 contracts and opportunities in advertising:

 http://www.google.com/patents/US20110016109?printsec=abstract#v=onepageqf=false

 However I believe there are many more applications that can benefit from
 similar matching techniques (i.e. recommender systems,
 e-commerce, recruiting,etc) to make it worthwhile implementing the ideas
 exposed in the original VLDB'09 paper (which is public) in Lucene.

 As a Yahoo! employee, I might not be able to directly contribute to this
 project but will be happy to point to any publicly available pointer that
 can help.

 Cheers,

 -- Joaquin


 On Sun, Mar 25, 2012 at 11:44 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Hello Joaquin,

 I looked through the paper several times, and see no problem to
 implement it in Lucene (the trivial case at least):

 Let's index conjunctive condition as
  {fieldA:valA,fieldB:valB,fieldC:valC,numClauses:3}

 then, form query from the incoming fact (event):
 fieldA:valA OR fieldB:valB OR fieldC:valC OR fieldD:valD

 to enforce overlap between condition and event, wrap the query above
 into own query whose scorer will check that numClauses for the matched doc
 is equal to number of matched clauses.
 To get numClauses for the matched doc you can use FieldCache that's
 damn fast; and number of matched clauses can be obtained from
 DisjunctionSumScorer.nrMatchers()

 Negative clauses, and multivalue can be covered also, I believe.

 WDYT?


 On Mon, Mar 5, 2012 at 10:05 PM, J. Delgado 
 joaquin.delg...@gmail.comwrote:

 I looked at LUCENE-2987 and its work on the query side (changes to the
 accepted syntax to accept lower case 'or' and 'and'), which isn't really
 related to my proposal.

 What I'm proposing is to be able to index complex boolean expressions
 using Lucene. This can be viewed as the opposite of the regular search
 task. The objective here is find a set of relevant queries given a document
 (assignment of values to fields).

 This by itself may not sound that interesting but its a key piece
 to efficiently implementing any MATCHING system which is effectively a
 two-way search where constraints are defined both-ways. An example of this
 would be:

 1) Job matching: Potential employers define their job posting as a
 documents along with complex boolean expressions used to narrow potential
 candidates. Job searchers upload their profile and may formulate complex
 queries when executing a search. Once a is search initiated from any of the
 sides constraints need to satisfied both ways.
 2) Advertising: Publishers define constraints on the type of
 advertisers/ads they are willing to show in their sites. On the other hand,
 advertisers define constraints (typically at the campaign level) on
 publisher sites they want their ads to show at as well as on the user
 audiences they are targeting to. While some attribute values are known at
 definition time, others are only instantiated once the user visits a given
 page which triggers a matching request that must be satisfied in
 few milliseconds to select valid ads and then scored based on 
 relevance.

 So in a matching system a MATCH QUERY is considered to to be a tuple
 that consists of a value assignment to attributes/fields (doc) + a boolean
 expression (query) that goes against a double index also built on tuples
 that  simultaneously boolean expressions and associated documents.

 To do this efficiently we need to be able to build indexes on Boolean
 expressions (Lucene queries) and retrieve the set of matching expressions
 given a doc (typically few attributes with values assigned), which is the
 core of what is described in this paper: Indexing Boolean Expressions
 (See http://www.vldb.org/pvldb/2/vldb09-83.pdf)

 -- J


 So to effectively resolve the problem

Re: Indexing Boolean Expressions

2013-02-11 Thread J. Delgado

Yes and no.

Most of the sponsored search and display ad systems have a distributed
matching service that select valid ads from each of the distributed node
and then computes actual bids locally for each ad based on bidding models.
Only the top bids per node are sent over to the auction service

Now, there has been some work on how to bake the bid model into a relevance
score to be able to select top N from the index, however it is not as
simple as it seems as some of these prediction/bidding models
use sophisticated machine learning algorithms based on training samples and
somewhat complex objective functions.

-- Joaquin


On Mon, Feb 11, 2013 at 10:02 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Yeah except with Percolator (which uses MemoryIndex under the hood, I
 believe), there is no relevancy score, so you just get match/no match, and
 if you need to figure out top N best matching ads against queries derived
 from page content, you need that relevancy score to get not all matching
 docs, but just those top N.

 No?

 Otis
 --
 http://sematext.com/





 On Mon, Feb 11, 2013 at 11:22 AM, J. Delgado joaquin.delg...@gmail.comwrote:

 I guess ElasticSearch went ahead of SOLR with the percolate API, which is
 exactly what is needed for two-way constraint+doc matching problem present
 in Advertising systems and other use cases:

 http://www.elasticsearch.org/guide/reference/api/percolate.html

 Cheers,

 Joaquin Delgado, PhD.
 http://www.linkedin.com/pub/profile/0/04b/277

 On Mon, Mar 26, 2012 at 10:17 AM, Walter Underwood wun...@wunderwood.org
  wrote:

 Efficient rule matching goes further back, at least to alerting in
 Verity K2.

 wunder
 Search Guy, Chegg

 On Mar 26, 2012, at 10:15 AM, J. Delgado wrote:

 BTW, the idea of indexing Boolean Expressions inside a text indexing
 engine is not new. For example Oracle Text provides the CTXRULE index and
 the MATCHES operator within their indexing stack, which is primarily used
 for Rule-based text classification.

 See:

 http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm#autoId8

 http://docs.oracle.com/cd/B28359_01/text.111/b28303/classify.htm#g1011013

 -- J

 On Mon, Mar 26, 2012 at 10:07 AM, J. Delgado 
 joaquin.delg...@gmail.comwrote:

 In full dislosure, there is a patent application that Yahoo! has filed
 for the use of inverted indexes for using complex  predicates for matching
 contracts and opportunities in advertising:

 http://www.google.com/patents/US20110016109?printsec=abstract#v=onepageqf=false

 However I believe there are many more applications that can benefit
 from similar matching techniques (i.e. recommender systems,
 e-commerce, recruiting,etc) to make it worthwhile implementing the ideas
 exposed in the original VLDB'09 paper (which is public) in Lucene.

 As a Yahoo! employee, I might not be able to directly contribute to
 this project but will be happy to point to any publicly available pointer
 that can help.

 Cheers,

 -- Joaquin


 On Sun, Mar 25, 2012 at 11:44 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Hello Joaquin,

 I looked through the paper several times, and see no problem to
 implement it in Lucene (the trivial case at least):

 Let's index conjunctive condition as
  {fieldA:valA,fieldB:valB,fieldC:valC,numClauses:3}

 then, form query from the incoming fact (event):
 fieldA:valA OR fieldB:valB OR fieldC:valC OR fieldD:valD

 to enforce overlap between condition and event, wrap the query above
 into own query whose scorer will check that numClauses for the matched doc
 is equal to number of matched clauses.
 To get numClauses for the matched doc you can use FieldCache that's
 damn fast; and number of matched clauses can be obtained from
 DisjunctionSumScorer.nrMatchers()

 Negative clauses, and multivalue can be covered also, I believe.

 WDYT?


 On Mon, Mar 5, 2012 at 10:05 PM, J. Delgado joaquin.delg...@gmail.com
  wrote:

 I looked at LUCENE-2987 and its work on the query side (changes to
 the accepted syntax to accept lower case 'or' and 'and'), which isn't
 really related to my proposal.

 What I'm proposing is to be able to index complex boolean expressions
 using Lucene. This can be viewed as the opposite of the regular search
 task. The objective here is find a set of relevant queries given a 
 document
 (assignment of values to fields).

 This by itself may not sound that interesting but its a key piece
 to efficiently implementing any MATCHING system which is effectively a
 two-way search where constraints are defined both-ways. An example of 
 this
 would be:

 1) Job matching: Potential employers define their job posting as a
 documents along with complex boolean expressions used to narrow potential
 candidates. Job searchers upload their profile and may formulate 
 complex
 queries when executing a search. Once a is search initiated from any of 
 the
 sides constraints need to satisfied both ways.
 2) Advertising: Publishers define constraints on the type

Re: Indexing Boolean Expressions

2012-03-26 Thread J. Delgado

In full dislosure, there is a patent application that Yahoo! has filed for
the use of inverted indexes for using complex  predicates for matching
contracts and opportunities in advertising:
http://www.google.com/patents/US20110016109?printsec=abstract#v=onepageqf=false

However I believe there are many more applications that can benefit from
similar matching techniques (i.e. recommender systems,
e-commerce, recruiting,etc) to make it worthwhile implementing the ideas
exposed in the original VLDB'09 paper (which is public) in Lucene.

As a Yahoo! employee, I might not be able to directly contribute to this
project but will be happy to point to any publicly available pointer that
can help.

Cheers,

-- Joaquin


On Sun, Mar 25, 2012 at 11:44 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Hello Joaquin,

 I looked through the paper several times, and see no problem to implement
 it in Lucene (the trivial case at least):

 Let's index conjunctive condition as
  {fieldA:valA,fieldB:valB,fieldC:valC,numClauses:3}

 then, form query from the incoming fact (event):
 fieldA:valA OR fieldB:valB OR fieldC:valC OR fieldD:valD

 to enforce overlap between condition and event, wrap the query above into
 own query whose scorer will check that numClauses for the matched doc is
 equal to number of matched clauses.
 To get numClauses for the matched doc you can use FieldCache that's damn
 fast; and number of matched clauses can be obtained from
 DisjunctionSumScorer.nrMatchers()

 Negative clauses, and multivalue can be covered also, I believe.

 WDYT?


 On Mon, Mar 5, 2012 at 10:05 PM, J. Delgado joaquin.delg...@gmail.comwrote:

 I looked at LUCENE-2987 and its work on the query side (changes to the
 accepted syntax to accept lower case 'or' and 'and'), which isn't really
 related to my proposal.

 What I'm proposing is to be able to index complex boolean expressions
 using Lucene. This can be viewed as the opposite of the regular search
 task. The objective here is find a set of relevant queries given a document
 (assignment of values to fields).

 This by itself may not sound that interesting but its a key piece
 to efficiently implementing any MATCHING system which is effectively a
 two-way search where constraints are defined both-ways. An example of this
 would be:

 1) Job matching: Potential employers define their job posting as a
 documents along with complex boolean expressions used to narrow potential
 candidates. Job searchers upload their profile and may formulate complex
 queries when executing a search. Once a is search initiated from any of the
 sides constraints need to satisfied both ways.
 2) Advertising: Publishers define constraints on the type of
 advertisers/ads they are willing to show in their sites. On the other hand,
 advertisers define constraints (typically at the campaign level) on
 publisher sites they want their ads to show at as well as on the user
 audiences they are targeting to. While some attribute values are known at
 definition time, others are only instantiated once the user visits a given
 page which triggers a matching request that must be satisfied in
 few milliseconds to select valid ads and then scored based on relevance.

 So in a matching system a MATCH QUERY is considered to to be a tuple that
 consists of a value assignment to attributes/fields (doc) + a boolean
 expression (query) that goes against a double index also built on tuples
 that  simultaneously boolean expressions and associated documents.

 To do this efficiently we need to be able to build indexes on Boolean
 expressions (Lucene queries) and retrieve the set of matching expressions
 given a doc (typically few attributes with values assigned), which is the
 core of what is described in this paper: Indexing Boolean Expressions
 (See http://www.vldb.org/pvldb/2/vldb09-83.pdf)

 -- J


 So to effectively resolve the problem of realtime matching one can

 On Tue, Feb 21, 2012 at 2:18 PM, Joe Cabrera calcmaste...@gmail.comwrote:

  On 02/21/2012 12:15 PM, Aayush Kothari wrote:




  So if Aayush Kothari is interested in working on this as a Student,
 all we need is a formal mentor (I can be the informal one).

  Anyone up for the task?


   Completely interested in working for and learning about the
 aforementioned subject/project. +1.

 This may be related to the work I'm doing with LUCENE-2987
 Basically changing the grammar to accepts conjunctions AND and OR in the
 query text.
 I would be interested in working with you on some of the details.

 However, I too am not a formal committer.

 --
 Joe Cabreraeminorlabs.com





 --
 Sincerely yours
 Mikhail Khludnev
 Lucid Certified
 Apache Lucene/Solr Developer
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com

Re: Indexing Boolean Expressions

2012-03-26 Thread J. Delgado

BTW, the idea of indexing Boolean Expressions inside a text indexing engine
is not new. For example Oracle Text provides the CTXRULE index and the
MATCHES operator within their indexing stack, which is primarily used for
Rule-based text classification.

See:

http://docs.oracle.com/cd/B28359_01/text.111/b28303/query.htm#autoId8

http://docs.oracle.com/cd/B28359_01/text.111/b28303/classify.htm#g1011013

-- J

On Mon, Mar 26, 2012 at 10:07 AM, J. Delgado joaquin.delg...@gmail.comwrote:

 In full dislosure, there is a patent application that Yahoo! has filed for
 the use of inverted indexes for using complex  predicates for matching
 contracts and opportunities in advertising:

 http://www.google.com/patents/US20110016109?printsec=abstract#v=onepageqf=false

 However I believe there are many more applications that can benefit from
 similar matching techniques (i.e. recommender systems,
 e-commerce, recruiting,etc) to make it worthwhile implementing the ideas
 exposed in the original VLDB'09 paper (which is public) in Lucene.

 As a Yahoo! employee, I might not be able to directly contribute to this
 project but will be happy to point to any publicly available pointer that
 can help.

 Cheers,

 -- Joaquin


 On Sun, Mar 25, 2012 at 11:44 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Hello Joaquin,

 I looked through the paper several times, and see no problem to implement
 it in Lucene (the trivial case at least):

 Let's index conjunctive condition as
  {fieldA:valA,fieldB:valB,fieldC:valC,numClauses:3}

 then, form query from the incoming fact (event):
 fieldA:valA OR fieldB:valB OR fieldC:valC OR fieldD:valD

 to enforce overlap between condition and event, wrap the query above into
 own query whose scorer will check that numClauses for the matched doc is
 equal to number of matched clauses.
 To get numClauses for the matched doc you can use FieldCache that's
 damn fast; and number of matched clauses can be obtained from
 DisjunctionSumScorer.nrMatchers()

 Negative clauses, and multivalue can be covered also, I believe.

 WDYT?


 On Mon, Mar 5, 2012 at 10:05 PM, J. Delgado joaquin.delg...@gmail.comwrote:

 I looked at LUCENE-2987 and its work on the query side (changes to the
 accepted syntax to accept lower case 'or' and 'and'), which isn't really
 related to my proposal.

 What I'm proposing is to be able to index complex boolean expressions
 using Lucene. This can be viewed as the opposite of the regular search
 task. The objective here is find a set of relevant queries given a document
 (assignment of values to fields).

 This by itself may not sound that interesting but its a key piece
 to efficiently implementing any MATCHING system which is effectively a
 two-way search where constraints are defined both-ways. An example of this
 would be:

 1) Job matching: Potential employers define their job posting as a
 documents along with complex boolean expressions used to narrow potential
 candidates. Job searchers upload their profile and may formulate complex
 queries when executing a search. Once a is search initiated from any of the
 sides constraints need to satisfied both ways.
 2) Advertising: Publishers define constraints on the type of
 advertisers/ads they are willing to show in their sites. On the other hand,
 advertisers define constraints (typically at the campaign level) on
 publisher sites they want their ads to show at as well as on the user
 audiences they are targeting to. While some attribute values are known at
 definition time, others are only instantiated once the user visits a given
 page which triggers a matching request that must be satisfied in
 few milliseconds to select valid ads and then scored based on relevance.

 So in a matching system a MATCH QUERY is considered to to be a tuple
 that consists of a value assignment to attributes/fields (doc) + a boolean
 expression (query) that goes against a double index also built on tuples
 that  simultaneously boolean expressions and associated documents.

 To do this efficiently we need to be able to build indexes on Boolean
 expressions (Lucene queries) and retrieve the set of matching expressions
 given a doc (typically few attributes with values assigned), which is the
 core of what is described in this paper: Indexing Boolean Expressions
 (See http://www.vldb.org/pvldb/2/vldb09-83.pdf)

 -- J


 So to effectively resolve the problem of realtime matching one can

 On Tue, Feb 21, 2012 at 2:18 PM, Joe Cabrera calcmaste...@gmail.comwrote:

  On 02/21/2012 12:15 PM, Aayush Kothari wrote:




  So if Aayush Kothari is interested in working on this as a Student,
 all we need is a formal mentor (I can be the informal one).

  Anyone up for the task?


   Completely interested in working for and learning about the
 aforementioned subject/project. +1.

 This may be related to the work I'm doing with LUCENE-2987
 Basically changing the grammar to accepts conjunctions AND and OR in
 the query text.
 I would be interested

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread J. Delgado

Mark, can you share more on what K-V (NoSQL) stores have you've been
benchmarking and what have been the results?

Did you try all the well known ones?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

-- J

On Thu, Mar 22, 2012 at 10:42 AM, mark harwood markharw...@yahoo.co.ukwrote:

 I've been spending quite a bit of time recently benchmarking various
 Key-Value stores for a demanding project and been largely disappointed with
 results
 However, I have developed a promising implementation based on these
 concepts:  http://www.slideshare.net/MarkHarwood/lucene-kvstore

 The code needs some packaging before I can release it but the slide deck
 should give a good overview of the design.


 Is this something that it is likely to be of interest as a contrib module
 here?
 I appreciate this is a departure from the regular search focus but it
 builds on some common ground in Lucene core and may have some applications
 here.

 Cheers,
 Mark


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Indexing Boolean Expressions

2012-03-05 Thread J. Delgado

I looked at LUCENE-2987 and its work on the query side (changes to the
accepted syntax to accept lower case 'or' and 'and'), which isn't really
related to my proposal.

What I'm proposing is to be able to index complex boolean expressions using
Lucene. This can be viewed as the opposite of the regular search task. The
objective here is find a set of relevant queries given a document
(assignment of values to fields).

This by itself may not sound that interesting but its a key piece
to efficiently implementing any MATCHING system which is effectively a
two-way search where constraints are defined both-ways. An example of this
would be:

1) Job matching: Potential employers define their job posting as a
documents along with complex boolean expressions used to narrow potential
candidates. Job searchers upload their profile and may formulate complex
queries when executing a search. Once a is search initiated from any of the
sides constraints need to satisfied both ways.
2) Advertising: Publishers define constraints on the type of
advertisers/ads they are willing to show in their sites. On the other hand,
advertisers define constraints (typically at the campaign level) on
publisher sites they want their ads to show at as well as on the user
audiences they are targeting to. While some attribute values are known at
definition time, others are only instantiated once the user visits a given
page which triggers a matching request that must be satisfied in
few milliseconds to select valid ads and then scored based on relevance.

So in a matching system a MATCH QUERY is considered to to be a tuple that
consists of a value assignment to attributes/fields (doc) + a boolean
expression (query) that goes against a double index also built on tuples
that  simultaneously boolean expressions and associated documents.

To do this efficiently we need to be able to build indexes on Boolean
expressions (Lucene queries) and retrieve the set of matching expressions
given a doc (typically few attributes with values assigned), which is the
core of what is described in this paper: Indexing Boolean Expressions
(See http://www.vldb.org/pvldb/2/vldb09-83.pdf)

-- J


So to effectively resolve the problem of realtime matching one can

On Tue, Feb 21, 2012 at 2:18 PM, Joe Cabrera calcmaste...@gmail.com wrote:

  On 02/21/2012 12:15 PM, Aayush Kothari wrote:




  So if Aayush Kothari is interested in working on this as a Student, all
 we need is a formal mentor (I can be the informal one).

  Anyone up for the task?


   Completely interested in working for and learning about the
 aforementioned subject/project. +1.

 This may be related to the work I'm doing with LUCENE-2987
 Basically changing the grammar to accepts conjunctions AND and OR in the
 query text.
 I would be interested in working with you on some of the details.

 However, I too am not a formal committer.

 --
 Joe Cabreraeminorlabs.com

Indexing Boolean Expressions

2012-02-21 Thread J. Delgado

Hi,

I would like to propose implementing Indexing Boolean Expressions (See
http://www.vldb.org/pvldb/2/vldb09-83.pdf) as a Lucene-based project for
GSoC.

Here is a snippet from the Abstract of the paper:
We consider the problem of efﬁciently indexing Disjunctive Normal Form
(DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a
high-dimensional multi-valued attribute space. The goal is to rapidly ﬁnd
the set of Boolean expressions that evaluate to true for a given assignment
of values to attributes. A solution to this problem has applications in
online advertising (where a Boolean expression represents an advertiser’s
user targeting requirements, and an assignment of values to attributes
represents the characteristics of a user visiting an online page) and in
general any publish/subscribe system (where a Boolean expression
represents a subscription, and an assignment of values to attributes
represents an event).

Any interest?

-- J

Re: Indexing Boolean Expressions

2012-02-21 Thread J. Delgado

According to http://community.apache.org/mentoringprogramme.html I'm not
allowed to be a Mentor, because I'm not a committer. However, I believe
this can be a really interesting (and useful) project as it has a variety
of applications, including advertising, recommender systems, matching
engines, information filtering, pub-sub systems, etc.

Here is an interesting quote off the paper:

IR systems [21, 26], which efﬁciently search documents given a
query, have been heavily studied. Our application is different in that
we are searching for queries (BEs) given the data (instead of the
other way around), and that we exploit the syntax of the complex
queries in order to exactly ﬁnd the satisﬁed BEs

So if Aayush Kothari is interested in working on this as a Student, all we
need is a formal mentor (I can be the informal one).

Anyone up for the task?

-- J

On Tue, Feb 21, 2012 at 8:28 AM, Aayush Kothari
aayush.kothar...@gmail.comwrote:

 That's a really nice application of DNF and CNF. I'd be happy to work at
 it if it gets approved in GSoC.


 On 21 February 2012 14:09, J. Delgado joaquin.delg...@gmail.com wrote:

 Hi,

 I would like to propose implementing Indexing Boolean Expressions (See
 http://www.vldb.org/pvldb/2/vldb09-83.pdf) as a Lucene-based project for
 GSoC.

 Here is a snippet from the Abstract of the paper:
 We consider the problem of efﬁciently indexing Disjunctive Normal Form
 (DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a
 high-dimensional multi-valued attribute space. The goal is to rapidly ﬁnd
 the set of Boolean expressions that evaluate to true for a given assignment
 of values to attributes. A solution to this problem has applications in
 online advertising (where a Boolean expression represents an advertiser’s
 user targeting requirements, and an assignment of values to attributes
 represents the characteristics of a user visiting an online page) and in
 general any publish/subscribe system (where a Boolean expression
 represents a subscription, and an assignment of values to attributes
 represents an event).

 Any interest?

 -- J

Re: Adding another dimension to Lucene searches

2010-05-10 Thread J. Delgado

Hierachical documents is a key concept towads a unified
structured+unstructured search. It should allow us to fully implement
things such as XQuery + Full-Text
(http://www.w3.org/TR/xquery-full-text/)

Additionally it solves a century old problem: how to deal with
section/sub-sections in very large documents. Long time ago I was
indexing text books (in PDF) and had to break down the book into pages
and store the main doc id in a field as pointer to maintain the
relation.

Mark, way to go!

-- Joaquin

On Mon, May 10, 2010 at 8:03 AM, Grant Ingersoll gsing...@apache.org wrote:
 Very cool stuff, Mark.

 Can you just open a JIRA and attach there?

 On May 10, 2010, at 8:38 AM, mark harwood wrote:

 I've put up code, example data and tests for the Nested Document feature 
 here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip

 The data used in the unit tests is chosen to illustrate practical use of 
 real-world content.
 The final unit tests will work on more abstract data for more 
 formal/exhaustive testing of functionality.

 This packaging changes no existing Lucene code and is bundled with 3.0.1 but 
 should work with 2.9.1. The readme.txt highlights the issues with segment 
 flushing that may need addressing before adoption.


 Cheers
 Mark





 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado

Here is the link to the paper.
http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

A more recent application of the use and extension of the WAND operator for
indexing of Boolean expressions:
http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

-- Joaquin

On Sun, Nov 15, 2009 at 11:15 PM, Uwe Schindler u...@thetaphi.de wrote:

  I see the attachment... (in java-dev)



 Uwe



 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
   --

 *From:* Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
 *Sent:* Monday, November 16, 2009 8:13 AM
 *To:* solr-...@lucene.apache.org
 *Cc:* java-dev@lucene.apache.org
 *Subject:* Re: Efficient Query Evaluation using a Two-Level Retrieval
 Process



 Hey Joaquin,



 The mailing list strips off attachments. Can you please upload it somewhere
 and give us the link?

 On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com
 wrote:

 Please find attached the paper on Efficient Query Evaluation using a
 Two-Level Retrieval Process. I believe that such approach may improve the
 way Lucene/Solr evaluates queries today.

 Cheers,

 -- Joaquin




 --
 Regards,
 Shalin Shekhar Mangar.

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado

As I understood it setMinimumNumberShouldMatch(int min) Is used to
specify a minimum number of the optional BooleanClauses which must be
satisfied.

I haven't seen the implementation of setMinimumNumberShouldMatch but
it seems a bit different than what is intended with the WAND operator,
which can take any real number as threshold θ

As stated in the paper:

WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ

where xi is the indicator variable for Xi, that is xi =  1, if Xi is
true 0, otherwise.

Observe that WAND can be used to implement AND
and OR via
AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
and
OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).

What I find interesting is the idea of using a first pass using the
upper bound (maximal) contribution of a term on any document score and
the dynamic setting of the threshold θ to skip or to fully evaluate a
document..

As stated in the paper:

Given this setup our preliminary scoring consists of evaluating
for each document d
WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
where Xi is an indicator variable for the presence of query term i in
document d and the threshold θ is varied during
the algorithm as explained below. If WAND evaluates to true, then the
document d undergoes a full evaluation.
The threshold θ is set dynamically by the algorithm based on the
minimum score m among the top n results found so
far, where n is the number of requested documents. The larger the
threshold, the more documents will be skipped
and thus we will need to compute full scores for fewer documents.

I think its worth a try...

-- Joaquin

On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki a...@getopt.org wrote:

 J. Delgado wrote:

 Here is the link to the paper.
 http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

 A more recent application of the use and extension of the WAND operator for
 indexing of Boolean expressions:
 http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

 -- Joaquin


 On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 Hey Joaquin,

 The mailing list strips off attachments. Can you please upload it somewhere
 and give us the link?

 On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com

 wrote:
 Please find attached the paper on Efficient Query Evaluation using a
 Two-Level Retrieval Process. I believe that such approach may improve

 the

 way Lucene/Solr evaluates queries today.

 The functionality of WAND (weak AND) is already implemented in Lucene, if I 
 understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). 
 Lucene implements this probably differently from the algorithm described in 
 the paper, so there may be still some benefits from comparing the algorithms 
 in Lucene's BooleanScorer[2] with this one ...


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado

On Mon, Nov 16, 2009 at 9:44 AM, Earwin Burrfoot ear...@gmail.com wrote:
 This algo is strictly tied to sort-by-score, if I understand it correctly.
 Lucene has queries and sorting decoupled (except for allowOutOfOrder
 mess), so implementing it would require some really fat hacks.


According to the paper on Indexing Boolean Expression (using the WAND
algo), sorting can be done based on scores that are determined based
weight assignment to key-value pairs:

http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

So I believe this can be generalized to sorting by any doc attributes
given the proper weight assignment model

Of course, the devil-is-in-the-details :-(

-- Joaquin


 On Mon, Nov 16, 2009 at 20:26, J. Delgado joaquin.delg...@gmail.com wrote:
 As I understood it setMinimumNumberShouldMatch(int min) Is used to
 specify a minimum number of the optional BooleanClauses which must be
 satisfied.

 I haven't seen the implementation of setMinimumNumberShouldMatch but
 it seems a bit different than what is intended with the WAND operator,
 which can take any real number as threshold θ

 As stated in the paper:

 WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ

 where xi is the indicator variable for Xi, that is xi =  1, if Xi is
 true 0, otherwise.

 Observe that WAND can be used to implement AND
 and OR via
 AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
 and
 OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).

 What I find interesting is the idea of using a first pass using the
 upper bound (maximal) contribution of a term on any document score and
 the dynamic setting of the threshold θ to skip or to fully evaluate a
 document..

 As stated in the paper:

 Given this setup our preliminary scoring consists of evaluating
 for each document d
 WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
 where Xi is an indicator variable for the presence of query term i in
 document d and the threshold θ is varied during
 the algorithm as explained below. If WAND evaluates to true, then the
 document d undergoes a full evaluation.
 The threshold θ is set dynamically by the algorithm based on the
 minimum score m among the top n results found so
 far, where n is the number of requested documents. The larger the
 threshold, the more documents will be skipped
 and thus we will need to compute full scores for fewer documents.

 I think its worth a try...

 -- Joaquin

 On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki a...@getopt.org wrote:

 J. Delgado wrote:

 Here is the link to the paper.
 http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

 A more recent application of the use and extension of the WAND operator for
 indexing of Boolean expressions:
 http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

 -- Joaquin


 On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 Hey Joaquin,

 The mailing list strips off attachments. Can you please upload it 
 somewhere
 and give us the link?

 On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com

 wrote:
 Please find attached the paper on Efficient Query Evaluation using a
 Two-Level Retrieval Process. I believe that such approach may improve

 the

 way Lucene/Solr evaluates queries today.

 The functionality of WAND (weak AND) is already implemented in Lucene, if I 
 understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). 
 Lucene implements this probably differently from the algorithm described in 
 the paper, so there may be still some benefits from comparing the 
 algorithms in Lucene's BooleanScorer[2] with this one ...


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado

Here is the link to the paper.
http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

A more recent application of the use and extension of the WAND operator for
indexing of Boolean expressions:
http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

-- Joaquin


On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Hey Joaquin,

 The mailing list strips off attachments. Can you please upload it somewhere
 and give us the link?

 On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com
 wrote:

  Please find attached the paper on Efficient Query Evaluation using a
  Two-Level Retrieval Process. I believe that such approach may improve
 the
  way Lucene/Solr evaluates queries today.
 
  Cheers,
 
  -- Joaquin
 



 --
 Regards,
 Shalin Shekhar Mangar.

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado

As I understood it setMinimumNumberShouldMatch(int min) Is used to
specify a minimum number of the optional BooleanClauses which must be
satisfied.

I haven't seen the implementation of setMinimumNumberShouldMatch but
it seems a bit different than what is intended with the WAND operator,
which can take any real number as threshold θ

As stated in the paper:

WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ

where xi is the indicator variable for Xi, that is xi =  1, if Xi is
true 0, otherwise.

Observe that WAND can be used to implement AND
and OR via
AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
and
OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).

What I find interesting is the idea of using a first pass using the
upper bound (maximal) contribution of a term on any document score and
the dynamic setting of the threshold θ to skip or to fully evaluate a
document..

As stated in the paper:

Given this setup our preliminary scoring consists of evaluating
for each document d
WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
where Xi is an indicator variable for the presence of query term i in
document d and the threshold θ is varied during
the algorithm as explained below. If WAND evaluates to true, then the
document d undergoes a full evaluation.
The threshold θ is set dynamically by the algorithm based on the
minimum score m among the top n results found so
far, where n is the number of requested documents. The larger the
threshold, the more documents will be skipped
and thus we will need to compute full scores for fewer documents.

I think its worth a try...

-- Joaquin

On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki a...@getopt.org wrote:

 J. Delgado wrote:

 Here is the link to the paper.
 http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

 A more recent application of the use and extension of the WAND operator for
 indexing of Boolean expressions:
 http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

 -- Joaquin


 On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 Hey Joaquin,

 The mailing list strips off attachments. Can you please upload it somewhere
 and give us the link?

 On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com

 wrote:
 Please find attached the paper on Efficient Query Evaluation using a
 Two-Level Retrieval Process. I believe that such approach may improve

 the

 way Lucene/Solr evaluates queries today.

 The functionality of WAND (weak AND) is already implemented in Lucene, if I 
 understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). 
 Lucene implements this probably differently from the algorithm described in 
 the paper, so there may be still some benefits from comparing the algorithms 
 in Lucene's BooleanScorer[2] with this one ...


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com

Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-15 Thread J. Delgado

Please find attached the paper on Efficient Query Evaluation using a
Two-Level Retrieval Process. I believe that such approach may improve the
way Lucene/Solr evaluates queries today.

Cheers,

-- Joaquin

Re: Grouping Lucene search results and calculating frequency by category

2009-04-11 Thread J. Delgado

Have you looked at SOLR?
http://lucene.apache.org/solr/

It pretty much has what you are looking for.

-- Joaquin

On Fri, Apr 10, 2009 at 9:39 PM, mitu2009 musicfrea...@gmail.com wrote:


 Am working on a store search API using Lucene.

 I need to show store search results for each City,State combination with
 its
 frequency in bracketsfor example:

 Los Angles,CA (450) Atlant,GA (212) Boston, MA (78) . . .

 As of now, my search results return around 7000 lucene documents on an
 average if the user says Show me all the stores. In this use case, I end
 up showing around 800 unique City,State records as shown above.

 Am overriding HitCollector class's Collect method and retrieving vectors as
 follows: var vectors = _reader.GetTermFreqVectors(doc); Then I iterate
 through this collection and calculate the frequency for each unique
 City,State combination.

 But this is turning out to be very very slow in performance...is there any
 better way of grouping search results and calculating frequency in Lucene?
 Code snippet would be very helpful

 Also,please suggest me if i can optimize my Lucene search code using any
 other techniques/tips

 Thanks for reading!

 --
 View this message in context:
 http://www.nabble.com/Grouping-Lucene-search-results-and-calculating-frequency-by-category-tp22997958p22997958.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-26 Thread J. Delgado

The addition of docs into tiny segments using the current data structures
seems the right way to go. Sometime back one of my engineers implemented
pseudo real-time using MultiSearcher by having an in-memory (RAM based)
short-term index that auto-merged into a disk-based long term index that
eventually get merged into archive indexes. Index optimization would take
place during these merges. The search we required was very time-sensitive
(searching last-minute breaking news wires). The advantage of having an
archive index is that very old documents in our applications were not
usually searched on unless archives were explicitely selected.

-- Joaquin

On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote:

 Michael McCandless wrote:

 So then I think we should start with approach #2 (build real-time on
 top of the Lucene core) and iterate from there.  Newly added docs go
 into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
 deleted docs record the delete against the right SegmentReader (and
 LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

 I would take the simple approach first: use ordinary SegmentReader on
 a RAMDirectory for the tiny segments.  If that proves too slow, swap
 in Memory/InstantiatedIndex for the tiny segments.  If that proves too
 slow, build a reader impl that reads from DocumentsWriter RAM buffer.


 +1 This sounds like a good approach to me.  I don't see any fundamental
 reasons why we need different representations, and fewer implementations of
 IndexWriter and IndexReader is generally better, unless they get way too
 hairy.  Mostly it seems that real-time can be done with our existing toolbox
 of datastructures, but with some slightly different control structures.
  Once we have the control structure in place then we should look at
 optimizing data structures as needed.

 Doug


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Realtime Search

2008-12-26 Thread J. Delgado

One thing that I forgot to mention is that in our implementation the
real-time indexing took place with many folder-based listeners writing  to
many  tiny in-memory indexes partitioned by sub-sources with fewer
long-term and archive indexes per box. Overall distributed search across
various lucene-based search services was done using a federator component,
very much like shard based searches is done today (I believe).

-- Joaquin.
l


On Fri, Dec 26, 2008 at 10:48 AM, J. Delgado joaquin.delg...@gmail.comwrote:

 The addition of docs into tiny segments using the current data structures
 seems the right way to go. Sometime back one of my engineers implemented
 pseudo real-time using MultiSearcher by having an in-memory (RAM based)
 short-term index that auto-merged into a disk-based long term index that
 eventually get merged into archive indexes. Index optimization would take
 place during these merges. The search we required was very time-sensitive
 (searching last-minute breaking news wires). The advantage of having an
 archive index is that very old documents in our applications were not
 usually searched on unless archives were explicitely selected.

 -- Joaquin


 On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting cutt...@apache.org wrote:

 Michael McCandless wrote:

 So then I think we should start with approach #2 (build real-time on
 top of the Lucene core) and iterate from there.  Newly added docs go
 into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
 deleted docs record the delete against the right SegmentReader (and
 LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

 I would take the simple approach first: use ordinary SegmentReader on
 a RAMDirectory for the tiny segments.  If that proves too slow, swap
 in Memory/InstantiatedIndex for the tiny segments.  If that proves too
 slow, build a reader impl that reads from DocumentsWriter RAM buffer.


 +1 This sounds like a good approach to me.  I don't see any fundamental
 reasons why we need different representations, and fewer implementations of
 IndexWriter and IndexReader is generally better, unless they get way too
 hairy.  Mostly it seems that real-time can be done with our existing toolbox
 of datastructures, but with some slightly different control structures.
  Once we have the control structure in place then we should look at
 optimizing data structures as needed.

 Doug


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Ocean and GData

2008-09-27 Thread J. Delgado

On Sat, Sep 27, 2008 at 5:03 AM, Jason Rutherglen 
[EMAIL PROTECTED] wrote:

 Unlike MapReduce, there are no infrastructure whitepapers on
 how GData/Base works so I had to make a broad comparison rather than a
 specific one.

My understanding is that GBase is based on the infrastructure that Google is
building for large scale distributed computing (Google File System,
MapReduce, BigTable, GData, etc.) More specifically, BigTable, the column
storage database which requires extremely high performance and
reliability, but provides only weak guarantees on data consistency. There is
plenty of documentation on these technologies.

I agree with Otis that it is clear to mention the characteristics of RDBMS
that real-time search displays such as atomicity and transactionality.

-- Joaquin

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado

On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् 
[EMAIL PROTECTED] wrote:

 Moving back to RDBMS model will be a big step backwards where we miss
 mulivalued fields and arbitrary fields .


 No one is suggesting to lose any of the virtues of the field based
indexing that Lucene provides. All but the contrary: by extending the RDBMS
model with Lucene-based indexes one can map relational rows to documents and
columns to fields. Note that one relational field can be mapped to one or
more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation
and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World
2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar
implementation in H2 seems not only feasible but desirable.

-- Joaquin




 On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
  Cool.  I mention H2 because it does have some Lucene code in it yes.
  Also according to some benchmarks it's the fastest of the open source
  databases.  I think it's possible to integrate realtime search for H2.
   I suppose there is no need to store the data in Lucene in this case?
  One loses the multiple values per field Lucene offers, and the schema
  become static.  Perhaps it's a trade off?
 
  On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED]
 wrote:
  Yes, both Marcelo and I would be interested.
 
  We looked into H2 and it looks like something similar to Oracle's ODCI
 can
  be implemented. Plus the primitive full-text implementación is based on
  Lucene.
  I say primitive because looking at the code I saw that one cannot define
 an
  Analyzer and for each scan corresponding to a where clause a searcher is
  open and closed, instead of having a pool, plus it does not have any way
 to
  queue changes to reduce the use of the IndexWriter, etc.
 
  But its open source and that is a great starting point!
 
  -- Joaquin
 
  On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
  [EMAIL PROTECTED] wrote:
 
  Perhaps an interesting project would be to integrate Ocean with H2
  www.h2database.com to take advantage of both models.  I'm not sure how
  exactly that would work, but it seems like it would not be too
  difficult.  Perhaps this would solve being able to perform faster
  hierarchical queries and perhaps other types of queries that Lucene is
  not capable of.
 
  Is this something Joaquin you are interested in collaborating on?  I
  am definitely interested in it.
 
  On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED]
  wrote:
   On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
   [EMAIL PROTECTED] wrote:
  
   Regarding real-time search and Solr, my feeling is the focus should
 be
   on
   first adding real-time search to Lucene, and then we'll figure out
 how
   to
   incorporate that into Solr later.
  
  
   Otis, what do you mean exactly by adding real-time search to
 Lucene?
Note
   that Lucene, being a indexing/search library (and not a full blown
   search
   engine), is by definition real-time: once you add/write a document
 to
   the
   index it becomes immediately searchable and if a document is
 logically
   deleted and no longer returned in a search, though physical deletion
   happens
   during an index optimization.
  
   Now, the problem of adding/deleting documents in bulk, as part of a
   transaction and making these documents available for search
 immediately
   after the transaction is commited sounds more like a search engine
   problem
   (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
 to
   be
   I/O expensive and thus are usually implemented bached proceeses with
   some
   kind of sync mechanism, which makes them non real-time.
  
   For example, in my previous life, I designed and help implement a
   quasi-realtime enterprise search engine using Lucene, having a set of
   multi-threaded indexers hitting a set of multiple indexes alocatted
   accross
   different search services which powered a broker based distributed
   search
   interface. The most recent documents provided to the indexers were
   always
   added to the smaller in-memory (RAM) indexes which usually could
 absorbe
   the
   load of a bulk add transaction and later would be merged into
 larger
   disk
   based indexes and then flushed to make them ready to absorbe new
 fresh
   docs.
   We even had further partitioning of the indexes that reflected time
   periods
   with caps on size for them to be merged into older more archive based
   indexes which were used less (yes the search engine default search
 was
   on
   data no more than 1 month old, though user could open the time window
 by
   including archives).
  
   As for SOLR and OCEAN,  I would argue that these semi-structured

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado

Sorry, I meant loose (replacing lose)

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado [EMAIL PROTECTED]wrote:

 On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 [EMAIL PROTECTED] wrote:

 Moving back to RDBMS model will be a big step backwards where we miss
 mulivalued fields and arbitrary fields .


  No one is suggesting to lose any of the virtues of the field based
 indexing that Lucene provides. All but the contrary: by extending the RDBMS
 model with Lucene-based indexes one can map relational rows to documents and
 columns to fields. Note that one relational field can be mapped to one or
 more text based fields and multi-valued fields will still be allowed.

 Please check the Lucence OJVM implementation for details on implementation
 and philosophy on the RDBMS-Lucene converged model:

 http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

 More discussions at Marcelo's blog who will be presenting in Oracle World
 2008 this week.
 http://marceloochoa.blogspot.com/

 BTW, it just happen that this was implemented using Oracle but similar
 implementation in H2 seems not only feasible but desirable.

 -- Joaquin




 On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
 [EMAIL PROTECTED] wrote:
  Cool.  I mention H2 because it does have some Lucene code in it yes.
  Also according to some benchmarks it's the fastest of the open source
  databases.  I think it's possible to integrate realtime search for H2.
   I suppose there is no need to store the data in Lucene in this case?
  One loses the multiple values per field Lucene offers, and the schema
  become static.  Perhaps it's a trade off?
 
  On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED]
 wrote:
  Yes, both Marcelo and I would be interested.
 
  We looked into H2 and it looks like something similar to Oracle's ODCI
 can
  be implemented. Plus the primitive full-text implementación is based on
  Lucene.
  I say primitive because looking at the code I saw that one cannot
 define an
  Analyzer and for each scan corresponding to a where clause a searcher
 is
  open and closed, instead of having a pool, plus it does not have any
 way to
  queue changes to reduce the use of the IndexWriter, etc.
 
  But its open source and that is a great starting point!
 
  -- Joaquin
 
  On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
  [EMAIL PROTECTED] wrote:
 
  Perhaps an interesting project would be to integrate Ocean with H2
  www.h2database.com to take advantage of both models.  I'm not sure
 how
  exactly that would work, but it seems like it would not be too
  difficult.  Perhaps this would solve being able to perform faster
  hierarchical queries and perhaps other types of queries that Lucene is
  not capable of.
 
  Is this something Joaquin you are interested in collaborating on?  I
  am definitely interested in it.
 
  On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED]
 
  wrote:
   On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
   [EMAIL PROTECTED] wrote:
  
   Regarding real-time search and Solr, my feeling is the focus should
 be
   on
   first adding real-time search to Lucene, and then we'll figure out
 how
   to
   incorporate that into Solr later.
  
  
   Otis, what do you mean exactly by adding real-time search to
 Lucene?
Note
   that Lucene, being a indexing/search library (and not a full blown
   search
   engine), is by definition real-time: once you add/write a document
 to
   the
   index it becomes immediately searchable and if a document is
 logically
   deleted and no longer returned in a search, though physical deletion
   happens
   during an index optimization.
  
   Now, the problem of adding/deleting documents in bulk, as part of a
   transaction and making these documents available for search
 immediately
   after the transaction is commited sounds more like a search engine
   problem
   (i.e. SOLR, Nutch, Ocean), specially if these transactions are known
 to
   be
   I/O expensive and thus are usually implemented bached proceeses with
   some
   kind of sync mechanism, which makes them non real-time.
  
   For example, in my previous life, I designed and help implement a
   quasi-realtime enterprise search engine using Lucene, having a set
 of
   multi-threaded indexers hitting a set of multiple indexes alocatted
   accross
   different search services which powered a broker based distributed
   search
   interface. The most recent documents provided to the indexers were
   always
   added to the smaller in-memory (RAM) indexes which usually could
 absorbe
   the
   load of a bulk add transaction and later would be merged into
 larger
   disk
   based indexes and then flushed to make them ready to absorbe new
 fresh
   docs.
   We even had further partitioning of the indexes that reflected time
   periods
   with caps on size for them to be merged into older more archive
 based
   indexes which were used less (yes the search engine default search
 was
   on
   data no more than 1 month old, though user

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread J. Delgado

Yes, both Marcelo and I would be interested.

We looked into H2 and it looks like something similar to Oracle's ODCI can
be implemented. Plus the primitive full-text implementación is based on
Lucene.
I say primitive because looking at the code I saw that one cannot define an
Analyzer and for each scan corresponding to a where clause a searcher is
open and closed, instead of having a pool, plus it does not have any way to
queue changes to reduce the use of the IndexWriter, etc.

But its open source and that is a great starting point!

-- Joaquin

On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen [EMAIL PROTECTED]
 wrote:

 Perhaps an interesting project would be to integrate Ocean with H2
 www.h2database.com to take advantage of both models.  I'm not sure how
 exactly that would work, but it seems like it would not be too
 difficult.  Perhaps this would solve being able to perform faster
 hierarchical queries and perhaps other types of queries that Lucene is
 not capable of.

 Is this something Joaquin you are interested in collaborating on?  I
 am definitely interested in it.

 On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED]
 wrote:
  On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
  [EMAIL PROTECTED] wrote:
 
  Regarding real-time search and Solr, my feeling is the focus should be
 on
  first adding real-time search to Lucene, and then we'll figure out how
 to
  incorporate that into Solr later.
 
 
  Otis, what do you mean exactly by adding real-time search to Lucene?
  Note
  that Lucene, being a indexing/search library (and not a full blown search
  engine), is by definition real-time: once you add/write a document to
 the
  index it becomes immediately searchable and if a document is logically
  deleted and no longer returned in a search, though physical deletion
 happens
  during an index optimization.
 
  Now, the problem of adding/deleting documents in bulk, as part of a
  transaction and making these documents available for search immediately
  after the transaction is commited sounds more like a search engine
 problem
  (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
 be
  I/O expensive and thus are usually implemented bached proceeses with some
  kind of sync mechanism, which makes them non real-time.
 
  For example, in my previous life, I designed and help implement a
  quasi-realtime enterprise search engine using Lucene, having a set of
  multi-threaded indexers hitting a set of multiple indexes alocatted
 accross
  different search services which powered a broker based distributed search
  interface. The most recent documents provided to the indexers were always
  added to the smaller in-memory (RAM) indexes which usually could absorbe
 the
  load of a bulk add transaction and later would be merged into larger
 disk
  based indexes and then flushed to make them ready to absorbe new fresh
 docs.
  We even had further partitioning of the indexes that reflected time
 periods
  with caps on size for them to be merged into older more archive based
  indexes which were used less (yes the search engine default search was on
  data no more than 1 month old, though user could open the time window by
  including archives).
 
  As for SOLR and OCEAN,  I would argue that these semi-structured search
  engines are becomming more and more like relational databases with
 full-text
  search capablities (without the benefit of full reletional algebra -- for
  example joins are not possible using SOLR). Notice that real-time CRUD
  operations and transactionality are core DB concepts adn have been
 studied
  and developed by database communities for aquite long time. There has
 been
  recent efforts on how to effeciently integrate Lucene into releational
  databases (see Lucene JVM ORACLE integration, see
 
 http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
 )
 
  I think we should seriously look at joining efforts with open-source
  Database engine projects, written in Java (see
  http://java-source.net/open-source/database-engines) in order to blend
 IR
  and ORM for once and for all.
 
  -- Joaquin
 
 
 
  I've read Jason's Wiki as well.  Actually, I had to read it a number of
  times to understand bits and pieces of it.  I have to admit there is
 still
  some fuzziness about the whole things in my head - is Ocean something
 that
  already works, a separate project on googlecode.com?  I think so.  If
 so,
  and if you are working on getting it integrated into Lucene, would it
 make
  it less confusing to just refer to it as real-time search, so there is
 no
  confusion?
 
  If this is to be initially integrated into Lucene, why are things like
  replication, crowding/field collapsing, locallucene, name service, tag
  index, etc. all mentioned there on the Wiki and bundled with description
 of
  how real-time search works and is to be implemented?  I suppose
 mentioning
  replication kind-of makes sense because the replication

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado

On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Regarding real-time search and Solr, my feeling is the focus should be on
 first adding real-time search to Lucene, and then we'll figure out how to
 incorporate that into Solr later.


Otis, what do you mean exactly by adding real-time search to Lucene?  Note
that Lucene, being a indexing/search library (and not a full blown search
engine), is by definition real-time: once you add/write a document to the
index it becomes immediately searchable and if a document is logically
deleted and no longer returned in a search, though physical deletion happens
during an index optimization.

Now, the problem of adding/deleting documents in bulk, as part of a
transaction and making these documents available for search immediately
after the transaction is commited sounds more like a search engine problem
(i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be
I/O expensive and thus are usually implemented bached proceeses with some
kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a
quasi-realtime enterprise search engine using Lucene, having a set of
multi-threaded indexers hitting a set of multiple indexes alocatted accross
different search services which powered a broker based distributed search
interface. The most recent documents provided to the indexers were always
added to the smaller in-memory (RAM) indexes which usually could absorbe the
load of a bulk add transaction and later would be merged into larger disk
based indexes and then flushed to make them ready to absorbe new fresh docs.
We even had further partitioning of the indexes that reflected time periods
with caps on size for them to be merged into older more archive based
indexes which were used less (yes the search engine default search was on
data no more than 1 month old, though user could open the time window by
including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search
engines are becomming more and more like relational databases with full-text
search capablities (without the benefit of full reletional algebra -- for
example joins are not possible using SOLR). Notice that real-time CRUD
operations and transactionality are core DB concepts adn have been studied
and developed by database communities for aquite long time. There has been
recent efforts on how to effeciently integrate Lucene into releational
databases (see Lucene JVM ORACLE integration, see
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
)

I think we should seriously look at joining efforts with open-source
Database engine projects, written in Java (see
http://java-source.net/open-source/database-engines) in order to blend IR
and ORM for once and for all.

-- Joaquin





 I've read Jason's Wiki as well.  Actually, I had to read it a number of
 times to understand bits and pieces of it.  I have to admit there is still
 some fuzziness about the whole things in my head - is Ocean something that
 already works, a separate project on googlecode.com?  I think so.  If so,
 and if you are working on getting it integrated into Lucene, would it make
 it less confusing to just refer to it as real-time search, so there is no
 confusion?

 If this is to be initially integrated into Lucene, why are things like
 replication, crowding/field collapsing, locallucene, name service, tag
 index, etc. all mentioned there on the Wiki and bundled with description of
 how real-time search works and is to be implemented?  I suppose mentioning
 replication kind-of makes sense because the replication approach is closely
 tied to real-time search - all query nodes need to see index changes fast.
  But Lucene itself offers no replication mechanism, so maybe the replication
 is something to figure out separately, say on the Solr level, later on once
 we get there.  I think even just the essential real-time search requires
 substantial changes to Lucene (I remember seeing large patches in JIRA),
 which makes it hard to digest, understand, comment on, and ultimately commit
 (hence the luke warm response, I think).  Bringing other non-essential
 elements into discussion at the same time makes it more difficult to
  process all this new stuff, at least for me.  Am I the only one who finds
 this hard?

 That said, it sounds like we have some discussion going (Karl...), so I
 look forward to understanding more! :)


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Yonik Seeley [EMAIL PROTECTED]
  To: java-dev@lucene.apache.org
  Sent: Thursday, September 4, 2008 10:13:32 AM
  Subject: Re: Realtime Search for Social Networks Collaboration
 
  On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
  wrote:
   I also think it's got a
   lot of things now which makes integration difficult to do properly.
 
  I agree, and that's why the

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado

On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote:

for example joins are not possible using SOLR).

 It's largely *because* Lucene doesn't do joins that it can be made to scale
 out. I've replaced two large-scale database systems this year with
 distributed Lucene solutions because this scale-out architecture provided
 significantly better performance. These were semi-structured systems too.
 Lucene's comparitively simplistic data model/query model is both a weakness
 and a strength in this regard.


 Hey, maybe the right way to go for a truly scalable and high performance
semi-structured database is to marry HBase (Big-table like data storage)
with SOLR/Lucene.I concur with you in the sense that simplistic data models
coupled with high performance are the killer.

Let me quote this from the original Bigtable paper from Google:

 Bigtable does not support a full relational data model; instead, it
provides clients with a simple data model that supports dynamic control over
data layout and format, and allows clients to reason about the locality
properties of the data represented in the underlying storage. Data is
indexed using row and column names that can be arbitrary strings. Bigtable
also treats data as uninterpreted strings, although clients often serialize
various forms of structured and semi-structured data into these strings.
Clients can control the locality of their data through careful choices in
their schemas. Finally, Bigtable schema parameters let clients dynamically
control whether to serve data out of memory or from disk.

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado

BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene
implementation) the three minimal features a transactional DB should support
for Lucene integration are:

  1) The ability to define new functions (e.g. lcontains() lscore) which
would allow to bind queries to lucene and obtain document/row scores
  2) An API that would allow DML intercepts, like  Oracle's ODCI.
  3) The ability to extend and/or implement new types of domain indexes
that the engine's query evaluation and execution/optimization planner can
use efficiently.

Thanks Marcelo.

-- Joaquin

On Sun, Sep 7, 2008 at 8:16 AM, J. Delgado [EMAIL PROTECTED]wrote:

 On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote:

  for example joins are not possible using SOLR).

 It's largely *because* Lucene doesn't do joins that it can be made to
 scale out. I've replaced two large-scale database systems this year with
 distributed Lucene solutions because this scale-out architecture provided
 significantly better performance. These were semi-structured systems too.
 Lucene's comparitively simplistic data model/query model is both a weakness
 and a strength in this regard.


  Hey, maybe the right way to go for a truly scalable and high performance
 semi-structured database is to marry HBase (Big-table like data storage)
 with SOLR/Lucene.I concur with you in the sense that simplistic data models
 coupled with high performance are the killer.

 Let me quote this from the original Bigtable paper from Google:

  Bigtable does not support a full relational data model; instead, it
 provides clients with a simple data model that supports dynamic control over
 data layout and format, and allows clients to reason about the locality
 properties of the data represented in the underlying storage. Data is
 indexed using row and column names that can be arbitrary strings. Bigtable
 also treats data as uninterpreted strings, although clients often serialize
 various forms of structured and semi-structured data into these strings.
 Clients can control the locality of their data through careful choices in
 their schemas. Finally, Bigtable schema parameters let clients dynamically
 control whether to serve data out of memory or from disk.

Re: Moving SweetSpotSimilarity out of contrib

2008-09-06 Thread J. Delgado

I cannot agree more with Otis. Its all about exposure! Without references
from main JavaDocs, some cool things in contrib just remain in obscurity.

-- Joaquin

On Sat, Sep 6, 2008 at 1:08 AM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

Regarding SSS (and any other contrib visibility).
Perhaps we should get into habit of referencing contrib goodies from highly
visible (to developers) spots (no pun intended), like Javadocs. Concretely,
if SSS is so good or if it is simply one possible alternative Similarity
that's available and that we (Lucene developers) know about, why are we not
mentioning it in Javadocs for (Default)Similarity?

http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/search/Similarity.html

http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/search/DefaultSimilarity.html

Javadocs have a lot of visibility, esp. in modern IDEs. We can also have
this mentioned on the Wiki, but Wiki is documentation that I think most
people don't really like to read.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Michael McCandless [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Friday, September 5, 2008 6:41:48 AM
Subject: Re: Moving SweetSpotSimilarity out of contrib

Chris Hostetter wrote:

: Another important driver is the out-of-the-box experience.

I honestly have no idea what an OOTB experience for Lucene-Java
means ...
For Solr i understand, For Nutch i understand ... for a java
library

Well... even though it's a java library, Lucene still has many
defaults.

Sure, Solr has even more, so this is important for Solr too.

Most non-Solr apps built on Lucene will simply use Lucene's defaults,
for lack of knowing any better.

How well such apps then work is what I'm calling the OOTB experience
for Lucene, and I think it's well-defined and important.

Especially spooky is when a publication does an eval of search
libraries because typically they will eval only the OOTB experience and
won't go looking on our wiki to discover all the tricks.

With IndexWriter we default to flushing by RAM usage (16 MB) not by
buffered doc count, to ConcurrentMergeScheduler, to
LogByteSizeMergePolicy, to compound file format, mergeFactor is 10,
etc.

IndexSearcher (and also IndexWriter, for lengthNorm) uses
Similarity.getDefault().

QueryParser uses a number of defaults when translating the end user's
search text into all sorts of Query instances.

In 2.3 we made great improvements to OOTB indexing speed, and that's
important.

I think making improvements to OOTB relevance is also important, but I
agree this is much harder to do in general since there are so many
differences between the content in apps.

That all being said... I also agree (on closer inspection) it's not
cut and dry that SSS is a good choice for default (what would be the
right default for its curve?).

If other OOTB relevance improvements surface with time (eg a good way
to do passage scoring/retrieval or proximity scoring or lexical
affinity) then we should strongly consider them. Such things always
come with a performance cost, though, so it'll be an interesting
discussion...

Butthen we get into that back-compat concern issue.

Well...is Lucene's precise scoring formula guaranteed not to change
between releases? I assume and hope not.

Just like with indexing, where the precise choice of when committing
and merging and flushing happens was never promised, that lack of
API promise gave us the freedom to drastically improve the OOTB
indexing speed without breaking any promises. We need to keep that
same freedom on the search side.

From our last discussion on back compat, our most powerful weapon is
to NOT make promises when they aren't necessary or could limit future
back compat.

And, if we have a back compat situation that's holding back Lucene's
OOTB adoption by new users, we should think hard about switching the
default to favor new users and making an option to quickly get back to
the old behavior to accomodate existing users. The recent bug fixes
to StandardTokenizer are such examples.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Re[4]: lucene scoring

2008-08-08 Thread J. Delgado

The only score that I can think of that can measure quality across
different queries are invariant scores such as pagerank. That is to score
the document on its general information value and then use that as a filter
regardless of the query. This is very different than the problem of trying
to nomalize the score on the same query over different shards (indexes) in a
federated query setting, which has been researched extensively.

The reason why two queries have different scale for scores is because of
the probabilistic nature of the algorithms which view word occurences as
independent random variables. Thus the occurence of each word in a document
is treated as an independent event. Joint and conditional probabilities can
estimated looking at word co-occurence, which could be used to compare two
specific results (i.e. how relevant is document X to both baby kittens and
death metal or if baby kitten is present in a doc how likely is that
death metal is present too), but to use the TF-IDF based score as as
absolute measure is like trying to compare Pears with Apples. Trying to
nomalize it is an ill-defined task.

-- J.D.



2008/8/8 Александр Аристов [EMAIL PROTECTED]

 Relevance ranking is an option but we still won't be able compare results.
 Lets say we have distributed searching - in this case top 10 from one server
 is not the same as those which are from another. Even worse we may get that
 in the resulting set a document with most top score is worse than others.

 what if we disable normalization or make it constant will results be
 absolutely dummy?

 And anther approach, can we calculate most possible top value? Or just
 maybe approximation of it? we then would be able to compare results with it.

 Alex


 -Original Message-
 From: Grant Ingersoll [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Date: Thu, 7 Aug 2008 15:54:41 -0400
 Subject: Re: Re[2]: lucene scoring


 On Aug 7, 2008, at 3:05 PM, Александр Аристов wrote:

  I want implement searching with ability to set so-called a
  confidence level below which I would treat documents as garbage. I
  cannot defile the level per query as the level should be relevant
  for all documents.
 
  With current scoring implementation the level would mean nothing. I
  don't believe that since that time (the thread is of 2005year)
  nothing has been made towards the resolving the issue.

 That's because there is no resolution to be had, as far as I know, but
 I'm open to suggestions (patches are even better.)  What would it mean
 to say that a score of 0.5 for baby kittens is comparable to a score
 of 0.5 for death metal?  Like I said, I don't think that 0.5 for
 baby kittens is even comparable later if you added other documents
 that contain any of the query terms.

 
 
  Do you think any workarounds like implementing more sophisticated
  queries so that we have approximately the same normalization values?

 I just don't think you will be successful with this, and I don't
 believe it is a Lucene issue alone, but one that applies to all search
 engines, but I could be wrong.

 I get what you are trying to do, though, I've wanted to do it from
 time to time.   Another approach may be to look for significant
 differences between scores w/in a result set.   For example, if doc 1
 is 0.8, doc 2 is 0.79 and then doc 3 is 0.2, then maybe one could
 argue that doc 3 is garbage, but even that is somewhat of a stretch.
 Garbage truly is in the eye of the beholder.

 Another option is to do more relevance tuning to make sure your top 10
 are as good as possible so that your garbage is minimized.

 -Grant
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: My understanding about lucene internals.

2008-06-30 Thread J. Delgado

Prasen,

Great summary!

On Mon, Jun 30, 2008 at 4:27 AM, Mukherjee, Prasenjit
[EMAIL PROTECTED] wrote:
 Hi,
  I have tried to consolidate my understanding of lucene with the
 following ppt slides. I would really aprpeciate your comments (
 specially where I am incorrect ) specifically on slide16 which talsk
 about the segment-layout( aka file-format )

 http://docs.google.com/Presentation?docid=dmsxgtg_98dbh529dn


 Thanks,
 Prasen

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to do a query using less than or greater than

2008-06-24 Thread J. Delgado

I do not believe that the operators  and  are supported by
Lucene, but you can use RANGE SEARCH to do achieve what you want. Just
put an unreachable upper boundary for greater than or lower boundary
for less than.

J.D.
On Tue, Jun 24, 2008 at 3:31 PM, Kyle Miller [EMAIL PROTECTED] wrote:
 Hi all,
   I've been looking at the lucene documentation and the source code
 and I can't seem to find a greater than or less than operator in the
 default query syntax for lucene.  Does anyone one know if they exists
 and how to use them?  For a concrete example I'm looking to do a query
 on a date field to find documents earlier than a specified date or
 later than a specified date.  Ex: date:( 20070101)  or date:
 (20070101).  I looked at the range query feature but it didn't appear
 to cover this case. Anyone have any suggestions?
 Thanks,
 Kyle

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Fwd: New binary distribution of Oracle-Lucene integration

2008-04-13 Thread J. Delgado

Here is the latest on the Oracle-Lucene Integration.

J.D.

-- Forwarded message --
From: Marcelo Ochoa [EMAIL PROTECTED]
Date: Mon, Apr 7, 2008 at 10:01 AM
Subject: New binary distribution of Oracle-Lucene integration
To: [EMAIL PROTECTED]

Hi all:
 I just released a new version of Oracle-Lucene integration
 implemented as a Domain Index.
 Binary distribution have a very straightforward installation and
 testing step, downloads are at SF.net web site:

http://sourceforge.net/project/showfiles.php?group_id=56183package_id=255524release_id=589900
 Updated documentation is available as Google Document at:
 http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
 Source is available with public CVS access at:
 http://dbprism.cvs.sourceforge.net/dbprism/ojvm/
 As consequence of reading many mails with feedback and development
 tips from this list this new version have a lot performance
 improvement by using a rowid-lucene doc id cache, usage of
 LoadFirstFieldSelector class to prevent Lucene from loading a complete
 doc if only we need the rowid.
 Many thanks to all for sharing the experience.
 A complete list of changes is at:

http://dbprism.cvs.sourceforge.net/dbprism/ojvm/ChangeLog.txt?revision=1.3view=markup
 Best regards, Marcelo.

 PD: I have a plan to a make a new version of Oracle-Lucene integration
 synchronized with Lucene 2.3.1 ASAP.
 --
 Marcelo F. Ochoa
 http://marceloochoa.blogspot.com/
 http://marcelo.ochoa.googlepages.com/home
 __
 Do you Know DBPrism? Look @ DB Prism's Web Site
 http://www.dbprism.com.ar/index.html
 More info?
 Chapter 17 of the book Programming the Oracle Database using Java 
 Web Services
 http://www.amazon.com/gp/product/183296/
 Chapter 21 of the book Professional XML Databases - Wrox Press
 http://www.amazon.com/gp/product/1861003587/
 Chapter 8 of the book Oracle  Open Source - O'Reilly
 http://www.oreilly.com/catalog/oracleopen/

--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book Programming the Oracle Database using Java 
Web Services
http://www.amazon.com/gp/product/183296/
Chapter 21 of the book Professional XML Databases - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book Oracle  Open Source - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

Fwd: New binary distribution of Oracle-Lucene integration

2008-04-13 Thread J. Delgado

Here is the latest on the Oracle-Lucene Integration.

J.D.

-- Forwarded message --
From: Marcelo Ochoa [EMAIL PROTECTED]
Date: Mon, Apr 7, 2008 at 10:01 AM
Subject: New binary distribution of Oracle-Lucene integration
To: [EMAIL PROTECTED]

Hi all:
 I just released a new version of Oracle-Lucene integration
 implemented as a Domain Index.
 Binary distribution have a very straightforward installation and
 testing step, downloads are at SF.net web site:

http://sourceforge.net/project/showfiles.php?group_id=56183package_id=255524release_id=589900
 Updated documentation is available as Google Document at:
 http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
 Source is available with public CVS access at:
 http://dbprism.cvs.sourceforge.net/dbprism/ojvm/
 As consequence of reading many mails with feedback and development
 tips from this list this new version have a lot performance
 improvement by using a rowid-lucene doc id cache, usage of
 LoadFirstFieldSelector class to prevent Lucene from loading a complete
 doc if only we need the rowid.
 Many thanks to all for sharing the experience.
 A complete list of changes is at:

http://dbprism.cvs.sourceforge.net/dbprism/ojvm/ChangeLog.txt?revision=1.3view=markup
 Best regards, Marcelo.

 PD: I have a plan to a make a new version of Oracle-Lucene integration
 synchronized with Lucene 2.3.1 ASAP.
 --
 Marcelo F. Ochoa
 http://marceloochoa.blogspot.com/
 http://marcelo.ochoa.googlepages.com/home
 __
 Do you Know DBPrism? Look @ DB Prism's Web Site
 http://www.dbprism.com.ar/index.html
 More info?
 Chapter 17 of the book Programming the Oracle Database using Java 
 Web Services
 http://www.amazon.com/gp/product/183296/
 Chapter 21 of the book Professional XML Databases - Wrox Press
 http://www.amazon.com/gp/product/1861003587/
 Chapter 8 of the book Oracle  Open Source - O'Reilly
 http://www.oreilly.com/catalog/oracleopen/

--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book Programming the Oracle Database using Java 
Web Services
http://www.amazon.com/gp/product/183296/
Chapter 21 of the book Professional XML Databases - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book Oracle  Open Source - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

Re: an API for synonym in Lucene-core

2008-03-13 Thread J. Delgado

Mathieu,

Have you thought about incorporating a standard format for thesaurus and
thus for query/index expansion. Here is the recommendation from NISO:
http://www.niso.org/committees/MT-info.html

Beyond synonyms, having the capabilities to specify the use of BT (broader
terms or Hypernyms) or NT (narrower terms or Hyponyms) is very useful to
provide more general or specific context to the query.

There are other tricks such as weighing terms from a thesaurus based on the
number of occurences in the index, as well as extracting potencial
used-as-for terms by looking at patters such as  a word followed by a
parethesis with small number of tokens (i.e.  term (alternate term)).

J.D.


On Thu, Mar 13, 2008 at 2:52 AM, Mathieu Lecarme [EMAIL PROTECTED]
wrote:

 I'll slice my contrib in small parts

 Synonyms
 1) Synonym (Token + a weight)
 2) Synonym provider from OO.o thesaurus
 3) SynonymTokenFilter
 4) Query expander wich apply a filter (and a boost) on each of its
 TermQuery
 5) a Synonym filter for the query expander
 6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
 7) Stemming can be used as a dynamic Synonym

 Spell checking or the do you mean? pattern
 1) The main concept is in the SpellCheck contrib, but in a not
 expandable implementation
 2) In some language, like French, homophony is very important in
 mispelling, there is more than one way to write it
 3) Homophony rules is provided by Aspell in a neutral language (just
 like SnowBall for stemming), I implemented a translator to build Java
 class from aspell file (it's the same format in aspell evolution :
 myspell and hunspell, wich are used in OO.o and firefox family)
 https://issues.apache.org/jira/browse/LUCENE-956

 Storing information about word found in an index
 1) It's the Dictionary used in SpellCheck contrib, in a more open way :
 a lexicon. It's a plain old lucene index, word become a Document, and
 Field store computed informations like size, Ngram token and homophony.
 All use filter took from TokenFilter, code duplication is avoided.
 2) this information can be not synchronized with the index, in order to
 not slow indexation process, so some informations need to be lately
 check (is this synonym already exist in the index?), and lexicon
 correction can be done on the fly (if the synonym doesn't exist, write
 it in the lexicon for the next time). There is some work here to find
 the best and fastest way to keep information synchronized between index
 and lexicon (hard link, log for nightly replay, complete iteration over
 the index to find deleted and new stuff ...)
 3) Similar (more than only Synonym) and Near (mispelled) words use
 Lexicon.
 https://issues.apache.org/jira/browse/LUCENE-1190

 Extending it
 1) Lexicon can be used to store Noun, ie words that better work
 together, like New York, Apple II or Alexander the great.
 Extracting nouns from a thesaurus is very hard, but Wikipedia peoples
 done a part of the work, article titles can be a good start to build a
 noun list. And it works in many languages.
 Noun can be used as an intuitive PhraseQuery, or as a suggestion for
 refining a results.

 Implementig it well in Lucene
 SpellCheck and WordNet contrib do a part of it, but in a specific and
 not extensible way, I think it's better when fundation is checked by
 Lucene maintener, and after, contrib is built on top of this fundation.

 M.


 Otis Gospodnetic a écrit :
  Grant, I think Mathieu is hinting at his JIRA contribution (I looked at
 it briefly the other day, but haven't had the chance to really understand
 it).
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Mathieu Lecarme [EMAIL PROTECTED]
  To: java-dev@lucene.apache.org
  Sent: Wednesday, March 12, 2008 5:47:40 AM
  Subject: an API for synonym in Lucene-core
 
  Why Lucen doesn't have a clean synonym API?
  WordNet contrib is not an answer, it provides an Interface for its own
  needs, and most of the world don't speak english.
  Compass provides a tool, just like Solr. Lucene is the framework for
  applications like Solr, Nutch or Compass, why don't backport low level
  features of this project?
  A synonym API should provide a TokenFilter, an abstract storage should
  map token - similar tokens with weight, and a tools for expanding
 query.
  Openoffice dictionnary project can provides data in differents
  languages, with compatible licences, I  presume.
 
  M.
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado

I assume that Google also has distributed index over their
GFS/MapReduce implementation. Any idea how they achieve this?

J.D.



On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote:

 There seem to be a few other players in this space too.

 Are you from Rackspace?
 (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
 query-terabytes-data)

 AOL also has a Hadoop/Solr project going on.

 CNET does not have much brewing there.  Although Yonik and I had
 talked about it a bunch -- but that was long ago.

 --cw

 Clay Webster   tel:1.908.541.3724
 Associate VP, Platform Infrastructure http://www.cnet.com
 CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]


  -Original Message-
  From: Ning Li [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 06, 2008 1:57 PM
  To: [EMAIL PROTECTED]; java-dev@lucene.apache.org; solr-
  [EMAIL PROTECTED]
  Subject: Lucene-based Distributed Index Leveraging Hadoop
 
  There have been several proposals for a Lucene-based distributed index
  architecture.
   1) Doug Cutting's Index Server Project Proposal at
 
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html
   2) Solr's Distributed Search at
  http://wiki.apache.org/solr/DistributedSearch
   3) Mark Butler's Distributed Lucene at
  http://wiki.apache.org/hadoop/DistributedLucene
 
  We have also been working on a Lucene-based distributed index
  architecture.
  Our design differs from the above proposals in the way it leverages
  Hadoop
  as much as possible. In particular, HDFS is used to reliably store
  Lucene
  instances, Map/Reduce is used to analyze documents and update Lucene
  instances
  in parallel, and Hadoop's IPC framework is used. Our design is geared
  for
  applications that require a highly scalable index and where batch
  updates
  to each Lucene instance are acceptable (verses finer-grained document
  at
  a time updates).
 
  We have a working implementation of our design and are in the process
  of evaluating its performance. An overview of our design is provided
  below.
  We welcome feedback and would like to know if you are interested in
  working
  on it. If so, we would be happy to make the code publicly available.
 At
  the
  same time, we would like to collaborate with people working on
 existing
  proposals and see if we can consolidate our efforts.
 
  TERMINOLOGY
  A distributed index is partitioned into shards. Each shard
  corresponds
  to
  a Lucene instance and contains a disjoint subset of the documents in
  the
  index.
  Each shard is stored in HDFS and served by one or more shard
 servers.
  Here
  we only talk about a single distributed index, but in practice
 multiple
  indexes
  can be supported.
 
  A master keeps track of the shard servers and the shards being
 served
  by
  them. An application updates and queries the global index through an
  index client. An index client communicates with the shard servers to
  execute a query.
 
  KEY RPC METHODS
  This section lists the key RPC methods in our design. To simplify the
  discussion, some of their parameters have been omitted.
 
On the Shard Servers
  // Execute a query on this shard server's Lucene instance.
  // This method is called by an index client.
  SearchResults search(Query query);
 
On the Master
  // Tell the master to update the shards, i.e., Lucene instances.
  // This method is called by an index client.
  boolean updateShards(Configuration conf);
 
  // Ask the master where the shards are located.
  // This method is called by an index client.
  LocatedShards getShardLocations();
 
  // Send a heartbeat to the master. This method is called by a
  // shard server. In the response, the master informs the
  // shard server when to switch to a newer version of the index.
  ShardServerCommand sendHeartbeat();
 
  QUERYING THE INDEX
  To query the index, an application sends a search request to an index
  client.
  The index client then calls the shard server search() method for each
  shard
  of the index, merges the results and returns them to the application.
  The
  index client caches the mapping between shards and shard servers by
  periodically calling the master's getShardLocations() method.
 
  UPDATING THE INDEX USING MAP/REDUCE
  To update the index, an application sends an update request to an
 index
  client.
  The index client then calls the master's updateShards() method, which
  schedules
  a Map/Reduce job to update the index. The Map/Reduce job updates the
  shards
  in
  parallel and copies the new index files of each shard (i.e., Lucene
  instance)
  to HDFS.
 
  The updateShards() method includes a configuration, which provides
  information for updating the shards. More specifically, the
  configuration
  includes the following information:
- Input path. This provides the location of updated documents, e.g.,
  HDFS

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado

I'm pretty sure that what you describe is the case, specially taking into
consideration that PageRank (what drives their search results) is a per
document value that is probably recomputed after some long time interval. I
did see a MapReduce algorithm to compute PageRank as well. However I do
think they must be distributing the query load across many many machines.

I also think that limiting flat results of the top 10 and then do paging is
optimized for performance. Yet another reason why Google has not implemented
facets browsing or real-time clustering around their result set.

J.D.

On Feb 6, 2008 4:22 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 (trimming excessive cc-s)

 Ning Li wrote:
  No. I'm curious too. :)
 
  On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote:
 
  I assume that Google also has distributed index over their
  GFS/MapReduce implementation. Any idea how they achieve this?

 I'm pretty sure that MapReduce/GFS/BigTable is used only for creating
 the index (as well as crawling, data mining, web graph analysis, static
 scoring etc). The overhead of MR jobs is just too high.

 Their impressive search response times are most likely the result of
 extensive caching of pre-computed partial hit lists for frequent terms
 and phrases - at least that's what I suspect after reading this paper
 (not by Google folks, but very enlightening):
 http://citeseer.ist.psu.edu/724464.html

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Oracle-Lucene Domain Index (New Release)

2007-12-13 Thread J. Delgado

Once again, LendingClub.com, a social lending network that today
announced nation-wide expansion (see Tech Crunch), is please to
contribute to the open source community a new release (2.2.0.2.0) of
the Oracle-Lucene Domain Index, a fast implementation of text indexing
and search using Lucene within the Oracle relational database. Many
thanks to Marcelo Ochoa, the developer that made it all happen!

Among the goodies you will find in the new release are:

* LuceneDomainIndex.countHits() function to replace select count from
.. where lcontains(..)0 syntax.
* support inline pagination at lcontains(col,'rownum:[n TO m] AND ...) function
* rounding and padding support for columns date, timestamp, mumber,
float, varchar2 and char
* ODCI API array DML support
* BLOB parameter support
* sort by column passed at
lcontains(col,query_parser_str,sort_str,corr_id) syntax
* Logging support using Java Util Logging package
* JUnit test suites emulating middle tier environment
* Support for rebuild and optimize online for SyncMode:OnLine index
* XMLDB Export which allows inspecting the Lucene index using Luke or
other tools
* AutoTuneMemory parameter for replacing MaxBufferedDocs parameter
* Functional column support

Here are the pointers

Full Documentation:
http://docs.google.com/Doc?docid=ddgw7sjp_54fgj9kghl=en

New Binaries
http://sourceforge.net/project/showfiles.php?group_id=56183package_id=255524

Release Notes:
http://sourceforge.net/project/shownotes.php?release_id=561159group_id=56183

Cheers!

Joaquin Delgado, PhD
CTO, Lending Club

About Lending Club (TM)
LendingClub.com is an online social lending network where people can
borrow and lend money among themselves based upon their affinities
and/or social connections. Across
all 50 states, members can borrow money at a better interest rate than
they would get from a bank or credit card and invest in a diversified
portfolio of loans with higher rates of
return than those served by savings accounts, CDs or other online
lending services.
LendingMatch (TM) technology helps match lenders and borrowers by
using connections established through social networks, associations
and online communities,
and build diversified portfolios based on lender preferences. Lending
Club is headquartered in Sunnyvale, CA. More information is available
at www.lendingclub.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Oracle-Lucene Domain Index (New Release)

2007-12-13 Thread J. Delgado

Once again, LendingClub.com, a social lending network that today
announced nation-wide expansion (see Tech Crunch), is please to
contribute to the open source community a new release (2.2.0.2.0) of
the Oracle-Lucene Domain Index, a fast implementation of text indexing
and search using Lucene within the Oracle relational database. Many
thanks to Marcelo Ochoa, the developer that made it all happen!

Among the goodies you will find in the new release are:

* LuceneDomainIndex.countHits() function to replace select count from
.. where lcontains(..)0 syntax.
* support inline pagination at lcontains(col,'rownum:[n TO m] AND ...) function
* rounding and padding support for columns date, timestamp, mumber,
float, varchar2 and char
* ODCI API array DML support
* BLOB parameter support
* sort by column passed at
lcontains(col,query_parser_str,sort_str,corr_id) syntax
* Logging support using Java Util Logging package
* JUnit test suites emulating middle tier environment
* Support for rebuild and optimize online for SyncMode:OnLine index
* XMLDB Export which allows inspecting the Lucene index using Luke or
other tools
* AutoTuneMemory parameter for replacing MaxBufferedDocs parameter
* Functional column support

Here are the pointers

Full Documentation:
http://docs.google.com/Doc?docid=ddgw7sjp_54fgj9kghl=en

New Binaries
http://sourceforge.net/project/showfiles.php?group_id=56183package_id=255524

Release Notes:
http://sourceforge.net/project/shownotes.php?release_id=561159group_id=56183

Cheers!

Joaquin Delgado, PhD
CTO, Lending Club

About Lending Club (TM)
LendingClub.com is an online social lending network where people can
borrow and lend money among themselves based upon their affinities
and/or social connections. Across
all 50 states, members can borrow money at a better interest rate than
they would get from a bank or credit card and invest in a diversified
portfolio of loans with higher rates of
return than those served by savings accounts, CDs or other online
lending services.
LendingMatch (TM) technology helps match lenders and borrowers by
using connections established through social networks, associations
and online communities,
and build diversified portfolios based on lender preferences. Lending
Club is headquartered in Sunnyvale, CA. More information is available
at www.lendingclub.com

Re: Lucene Analyzers

2007-10-28 Thread J. Delgado

If you don't want to start from scratch you may look at what is available in
the GATE framework, also written in Java:
http://gate.ac.uk/gate/doc/plugins.html#hindi

2007/10/28, Grant Ingersoll [EMAIL PROTECTED]:

 A Google search reveals:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200408.mbox/[EMAIL 
 PROTECTED]

 Which leads to
 http://ltrc.iiit.net/showfile.php?filename=onlineServices/morph/index.htm

 However, I don't see one contributed to contrib/analyzers, so feel
 free to take it on.  Sounds like a welcome addition to me.

 You might also try asking others on the Lucene User mailing list
 concerning their experience.

 Cheers,
 Grant

 On Oct 28, 2007, at 8:49 PM, Sandeep Mahendru wrote:

  Hi All,
 
My name is Sandeep Mehandru.
 
  I have been working at Wachovia Bank, charlotte North Carolina.
 
  I have been involved in a project,where I am designing a Report/Log
  tracker,
  which support English like queries.
  I have been using Lucene indexing/searching a lot.
 
  I have gone through the concepts of Analyzers, Filters and Tokens. I
  have
  laos done some lexical analysis in past on some of the projects.
 
  I am very interested in writing a Lucene anlayzer for the HINDI
  language.
  Has this work been done, If not i would like to work on it and add
  it to the
  Lucene API.
 
  I knwo that first I would have to work on defining the grammer for
  the Hindi
  language.
 
  Please let me know your comments on the same.
 
  Regards,

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Boot Camp Training:
 ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG

2007-09-13 Thread J. Delgado

I'm very happy to announce the partial rework and extension to LUCENE-724
(Oracle-Lucene Integration), primarily based on new requirements from
LendingClub.com, who commissioned the work to Marcelo Ochoa, the contributer
of the original patch (great job Marcelo!). As contribution of
LendingClub.com to the Lucene community we have posted the code on a public
CVS (sourceforge) as explained below.

Here at Lending Club (www.lendingclub.com) we have very specific needs
regarding the indexing of both structured and unstructured data, most of it
transactional in nature and siting in our Oracle !0gR2 DB, with a highly
complex schema. Our ranking of loans in the inventory includes components
of exact, textual and hardcore mathematical calculations including time,
amount and spatial constraints. This integration of Lucene into Oracle as a
Domain Index will now allow us to query this inventory in real-time. Going
against the Lucene index, created on synthetic documents comprised of
fields being populated from diverse tables (user data store), eliminates the
need to create very complex joins to link data from different tables at
query time. This, along with the support of the full Lucene query language,
makes this a great alternative to:

   1. Using Lucene outside the database which requires crawling the
   data and storing the index outside the database, loosing all the benefits of
   a fully transactional system and a secure environment.
   2. Using Oracle Text, which is very powerful but lacks the
   extensibility and flexibility that Lucene offers (for example, being able to
   query directly the index from the Java layer or implementing our our ranking
   algorithm), though to be completely fair some of it is addressed in the new
   Oracle DB 11g version.

If anyone is interested in learning more how we are going to use this within
Lending Club, please drop me a line. BTW, please make sure you check us out:
Lending Club (http://www.lendingclub.com/), the rapidly growing
people-to-people (P2P) lending service that launched as a Facebook
application in May 2007, today announced the public availability of its
services with the launch of LendingClub.com. Lending Club connects lenders
and borrowers based upon shared affinities, enabling them to bypass banks to
secure better interest rates on loans... more about the announcement here
http://www.sys-con.com/read/428678.htm. We have seen man entrepreneurs
applying for loans and being helped by regular people to build their
business with the money obtained at very low interest.

OK, without further marketing stuff (sorry for that), here is the original
note sent to me by Marcelo that summarizes all the new cool functionalities:

OJVMDirectory, a Lucene Integration running inside the Oracle JVM is going
one step further.

This new release includes:

   - Synchronized with latest Lucene 2.2.0 production
   - Replaced in memory storage using Vector based implementation by
   direct BLOB IO, reducing memory usage for large index.
   - Support for user data stores, it means you can not only index one
   column at time (limited by Data Cartridge API on 10g), now you can index
   multiples columns at base table and columns on related tabled joined
   together.
   - User Data Stores can be customized by the user, it means writing a
   simple Java Class users can control which column are indexed, padding
   - used or any other functionality previous to document adding step.
   - There is a DefaultUserDataStore which gets all columns of the query
   and built a Lucene Document with Fields representing each database
   - columns these fields are automatically padded if they have NUMBER or
   rounded if they have DATE data, for example.
   - lcontains() SQL operator support full Lucene's QueryParser syntax to
   provide access to all columns indexed, see examples below.
   - Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if
   you want to get rows order by lscore() operator (ascending,descending) the
   optimizer hint will assume that Lucene Domain Index will returns rowids in
   proper order avoided an inline-view to sort it.
   - Automatic index synchronization by using AQ's Call Back.
   - Lucene Domain Index creates extra tables named IndexName$T and an
   Oracle AQ named IndexName$Q with his storage table IndexName$QT at user's
   schema, so you can alter storage's preference if you want.
   - ojvm project is at SourceForge.net CVS, so anybody can get it and
   collaborate ;)
   - Tested against 10gR2 and 11g database.


Some sample usages:

create table t2 (
 f4 number primary key,
 f5 VARCHAR2(200));
create table t1 (
 f1 number,
 f2 CLOB,
 f3 number,
 CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
 REFERENCES t2(f4) ON DELETE cascade);
create index it1 on t1(f3) indextype is lucene.LuceneIndex
 parameters('Analyzer:org.apache.lucene.analysis
.SimpleAnalyzer;ExtraCols:f2');

alter index it1

Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG

2007-09-13 Thread J. Delgado

I'm very happy to announce the partial rework and extension to LUCENE-724
(Oracle-Lucene Integration), primarily based on new requirements from
LendingClub.com, who commissioned the work to Marcelo Ochoa, the contributer
of the original patch (great job Marcelo!). As contribution of
LendingClub.com to the Lucene community we have posted the code on a public
CVS (sourceforge) as explained below.

Here at Lending Club (www.lendingclub.com) we have very specific needs
regarding the indexing of both structured and unstructured data, most of it
transactional in nature and siting in our Oracle !0gR2 DB, with a highly
complex schema. Our ranking of loans in the inventory includes components
of exact, textual and hardcore mathematical calculations including time,
amount and spatial constraints. This integration of Lucene into Oracle as a
Domain Index will now allow us to query this inventory in real-time. Going
against the Lucene index, created on synthetic documents comprised of
fields being populated from diverse tables (user data store), eliminates the
need to create very complex joins to link data from different tables at
query time. This, along with the support of the full Lucene query language,
makes this a great alternative to:

   1. Using Lucene outside the database which requires crawling the
   data and storing the index outside the database, loosing all the benefits of
   a fully transactional system and a secure environment.
   2. Using Oracle Text, which is very powerful but lacks the
   extensibility and flexibility that Lucene offers (for example, being able to
   query directly the index from the Java layer or implementing our our ranking
   algorithm), though to be completely fair some of it is addressed in the new
   Oracle DB 11g version.

If anyone is interested in learning more how we are going to use this within
Lending Club, please drop me a line. BTW, please make sure you check us out:
Lending Club (http://www.lendingclub.com/), the rapidly growing
people-to-people (P2P) lending service that launched as a Facebook
application in May 2007, today announced the public availability of its
services with the launch of LendingClub.com. Lending Club connects lenders
and borrowers based upon shared affinities, enabling them to bypass banks to
secure better interest rates on loans... more about the announcement here
http://www.sys-con.com/read/428678.htm. We have seen man entrepreneurs
applying for loans and being helped by regular people to build their
business with the money obtained at very low interest.

OK, without further marketing stuff (sorry for that), here is the original
note sent to me by Marcelo that summarizes all the new cool functionalities:

OJVMDirectory, a Lucene Integration running inside the Oracle JVM is going
one step further.

This new release includes:

   - Synchronized with latest Lucene 2.2.0 production
   - Replaced in memory storage using Vector based implementation by
   direct BLOB IO, reducing memory usage for large index.
   - Support for user data stores, it means you can not only index one
   column at time (limited by Data Cartridge API on 10g), now you can index
   multiples columns at base table and columns on related tabled joined
   together.
   - User Data Stores can be customized by the user, it means writing a
   simple Java Class users can control which column are indexed, padding
   - used or any other functionality previous to document adding step.
   - There is a DefaultUserDataStore which gets all columns of the query
   and built a Lucene Document with Fields representing each database
   - columns these fields are automatically padded if they have NUMBER or
   rounded if they have DATE data, for example.
   - lcontains() SQL operator support full Lucene's QueryParser syntax to
   provide access to all columns indexed, see examples below.
   - Support for DOMAIN_INDEX_SORT and FIRST_ROWS hint, it means that if
   you want to get rows order by lscore() operator (ascending,descending) the
   optimizer hint will assume that Lucene Domain Index will returns rowids in
   proper order avoided an inline-view to sort it.
   - Automatic index synchronization by using AQ's Call Back.
   - Lucene Domain Index creates extra tables named IndexName$T and an
   Oracle AQ named IndexName$Q with his storage table IndexName$QT at user's
   schema, so you can alter storage's preference if you want.
   - ojvm project is at SourceForge.net CVS, so anybody can get it and
   collaborate ;)
   - Tested against 10gR2 and 11g database.


Some sample usages:

create table t2 (
 f4 number primary key,
 f5 VARCHAR2(200));
create table t1 (
 f1 number,
 f2 CLOB,
 f3 number,
 CONSTRAINT t1_t2_fk FOREIGN KEY (f3)
 REFERENCES t2(f4) ON DELETE cascade);
create index it1 on t1(f3) indextype is lucene.LuceneIndex
 parameters('Analyzer:org.apache.lucene.analysis
.SimpleAnalyzer;ExtraCols:f2');

alter index it1

Re: Progressive Query Relaxation

2007-05-11 Thread J. Delgado


Hoss,

I never got to acknowledge your analisis. Well done. I do want to hear your
opinion about the following posting I sent to the list, which aims and
looking at the anolalogy between search engines and relational/XML databases
as the progress to evolve into a single type of retrieval system:

The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
attribute,value fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus  dealing no longer just with a set/relational model  that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. ( e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of index
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model


-- J.D.
2007/4/10, Chris Hostetter [EMAIL PROTECTED]:



: Agreed, but best match is not ONLY about keywords. Here is where the
: system developer can provide extra intelligence by doing query
: re-writing.

I finally got a chance to read through the URL (disclaimer: i do not have
a basic working knowledge of Oracle Text, such as the operators used in
query expressions.)

At it's core what is being described here can easily be done with a custom
request handler that takes in a multivalue q param, and executes them in
order until it finds some matches ... careful math when dealing start/rows
and the number of results from each query make it easy to ensure that you
can seemlessly return results from any/all queries in the order described
(allthough you'd have to do something funky with the raw score values if
you actually wanted to return them to the client)

In general though, I agree with Walter ... this seems like a very naive
approach.  At a very low conceptually level, The DisMaxRequestHandler does
what the early counter example in the link talks about...

  select book_id from books
  where contains (author, '(michel crichton) OR (?michel ?crichton)
  OR (michel OR crichton) OR (?michel OR ?crichton)

the problem is that the two critisism of this appraoch (which may be valid
in Oracle text matching) don't really apply in Solr/Lucene...

   1.  From the user's point of view, hits which are a poor match will
be
 mixed in with hits which are a good match. The user wants to see good
 matches displayed first.

poor hits won't score as high as good hits -- boost
values can be assigned for hte various pieces of the DisMax query so that
exact phrase matches can be weighted better then individual word matches,
coordFactors will ensure that docs only matching a few words don't score
as well as docs matching all of the words, etc...

   2. From the system's point of view, the search is inefficient. Even
if
 there were plenty of hits for exactly Michel Crichton, it would still
 have to do all the work of the fuzzy expansions and fetch data for all
the
 rows which satisfy the query.

My problem with this claim is the assumption that once you find lots of
hits for Michel Crichton you don't need to keep looking for Michel or
Crichton ... by this logic, many docs that contain the exact phrase
Michel Crichton (and are roughly the same length) will get the same
score, and the query will stop there ... the benefit of looking for
8everything* as a single query, is that the scores can become more fine
grained -- docs with 1 exact match that *also* contain things like Mr
Crichton several dozen times will score higher then docs with just that
one exact match (cosider an article about Michel Crichton in which his
full name appears only once vs an article listing popular authors, in
which Michel Crichton appears exactly once)

: Why do you say this? The rank is still provided by the search engine
: BASED ON THE QUERY submitted and

Re: Various Ideas from ApacheCon

2007-05-10 Thread J. Delgado


The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
attribute,value fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus  dealing no longer just with a set/relational model  that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. (e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of index
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model

2007/5/9, James liu [EMAIL PROTECTED]:


I think the topest thing lucene/solr should do:
1: more easy use and less code
2: distributed index and search
3: manage these index and search server
4: test method or tool

i don't agree

2007/5/8, Grant Ingersoll [EMAIL PROTECTED]:Yep, my advice always is
use
a db for what a db is designed for (set
manipulation) and use Lucene for what it is good for

maybe fs+lucene/solr is better


--
regards
jl

Re: Various Ideas from ApacheCon

2007-05-10 Thread J. Delgado


The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
attribute,value fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus  dealing no longer just with a set/relational model  that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. (e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of index
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model

2007/5/9, James liu [EMAIL PROTECTED]:


I think the topest thing lucene/solr should do:
1: more easy use and less code
2: distributed index and search
3: manage these index and search server
4: test method or tool

i don't agree

2007/5/8, Grant Ingersoll [EMAIL PROTECTED]:Yep, my advice always is
use
a db for what a db is designed for (set
manipulation) and use Lucene for what it is good for

maybe fs+lucene/solr is better


--
regards
jl

Re: Progressive Query Relaxation

2007-04-10 Thread J. Delgado


It looks only a handful of people actually looked at the link
provided. Indeed, it is hard to let the engine come up with a series
of queries that range from the most restrictive to the less
restrictive and still provides the best relevance. The problem is even
worse if what is relevant for one use case is irrelevant for others,
specially if you are dealing with mixed queries that combine text
with structured fields (i.e. range queries and sorting in the Lucene
case).

Let's suppose that you are dealing with the e-commerce/catalog search
case where you would like to show hits that match the query submitted
by the end user first as exact phrase, then with the terms near each
other , then AND'd then fuzzy or stemming and finally OR'd on certain
fields (i.e. title, description ,etc) and additionally you want to
always try to match  hits from those that were within the price range
specified and if not enough relax second best (i.e. just above the
price range) until you get a max of 500 hits. All this defines exactly
in what order you want to show/rank results! How many queries do you
have to build and send to Solr/Lucene to achieve this?

Progressive relaxation, at least as Oracle has defined it, is a
flexible, developer defined series of queries that are efficiently
executed in progression and in one trip to the engine, until minimum
of hits required is satisfied. It is not a self adapting precision
scheme nor it tries to guess what is the best match. This approach is
however very powerful (as powerful as the queries that are submitted)
and leaves the developer with the choice of controlling what queries
to execute and in which order. I don't think  DisMax does this.

-- Joaquin


2007/4/10, Walter Underwood [EMAIL PROTECTED]:

From the name, I thought this was an adaptive precision scheme,
where the engine automatically tries broader matching if there
are no matches or just a few. We talked about doing that with
Ultraseek, but it is pretty tricky. Deciding when to adjust it is
harder than making it variable.

Instead, this is an old idea that search amateurs seem to like.
Show all exact matches, then near matches, etc. This is the
kind of thing people suggest when they don't understand that
a ranking algorithm combines that evidence in a much more
powerful way. I talked customers out of this once or twice
each year at Ultraseek.

This approach fails for:

* common words
* misspellings

Since both of those happen a lot, this idea fails for a lot
of queries.

I presume that Oracle implemented this to shut up some big customer,
since it isn't a useful feature unless it closes a sale.

DisMax gives you something somewhat similar to this, by
selecting the best matching field. That is much more powerful
and gives much better results.

wunder

On 4/9/07 12:46 AM, J. Delgado [EMAIL PROTECTED] wrote:

 Has anyone within the Lucene or Solr community attempted to code a
 progressive query relaxation technique similar to the one described
 here for Oracle Text?
 http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

 Thanks,

 -- J.D.

Re: Progressive Query Relaxation

2007-04-10 Thread J. Delgado


See my comments below.

2007/4/10, Walter Underwood [EMAIL PROTECTED]:

On 4/10/07 10:06 AM, J. Delgado [EMAIL PROTECTED] wrote:

 Progressive relaxation, at least as Oracle has defined it, is a
 flexible, developer defined series of queries that are efficiently
 executed in progression and in one trip to the engine, until minimum
 of hits required is satisfied. It is not a self adapting precision
 scheme nor it tries to guess what is the best match.

Correct. Search engines are all about the best match. Why would
you show anything else?


Agreed, but best match is not ONLY about keywords. Here is where the
system developer can provide extra intelligence by doing query
re-writing.



This is an RDBMS flavored approach, not an approach that considers
natural language text.


Why do you say this? The rank is still provided by the search engine
BASED ON THE QUERY submitted and it does consider natural language
text. It's just leaving the order of execution in the hands of the
developer who knows better what the system should return for some
specific cases.


Sets of matches, not a ranked list. It fails
as soon as one of the sets gets too big, like when someone searches
for laserjet at HP.com. That happens a lot.


Nope...we are talking about the same thing: a ranked list, and all the
other cool stuff regarding automatic query expansion, hit list
clustering/faceted search, etc have solve the laserjet problem you
mentioned above.



It assumes that all keywords are the same, something that Gerry
Salton figured out was false thirty years ago. That is why we
use tf.idf instead of sets of matches.


I'm totally with you. Oracle Text uses TF.IDF as well :-)



I see a lot of design without any talk about what problem they are
solving. What queries don't work? How do we make those better?
Let's work from real logs and real data. Oracle's hack doesn't
solve any problem I've see in real query logs.



I think you have something personal against Oracle... Hey I have no
interest in defending Oracle, but this no hack. It has its place for
certain applications. I'm not in favor on using Oracle Text, all I
asked was if this feature was available in Solr/Lucene because I think
it would be useful.


I'm doing e-commerce search, and our current engine does pretty
much what Oracle is offering. The results are not good, and we
are replacing it with Solr and DisMax. My off-line relevance testing
shows a big improvement.


Yep. One thing we agree on (that Netflix's engine's result is not
good). In any case, I think moving to Sorl and DisMax is a great idea
and should improve relvance. I also think that in some cases having
control of the queries that are expanded and executing them
progressively is the right way to go. For example , Nutch implements a
pretty sofisticated query rewrite in hopes of improving the relevance
ranking for their users. I think the results can be computed more
efficently if they whole query does not need to be evaluated, but just
enough of it that will return the required number of results.

Joaquin Delgado, PhD



wunder
--
Search Guru, Netflix

Progressive Query Relaxation

2007-04-09 Thread J. Delgado


Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Progressive Query Relaxation

2007-04-09 Thread J. Delgado


The idea is to efficiently get the desired result set (top N) at once
without having to re-run different queries inside the application
logic. Query relaxation avoids having several round trips and possibly
could be offered with and without deduplication. Maybe this is a
feature required for Solr rather than for Lucene.

Question: Even if lucene's score is not absolute does it somewhat
determine an partial order among results of different queries?

J.D.

2007/4/9, Otis Gospodnetic [EMAIL PROTECTED]:

Not that I know of.  One typically puts that in application logic and re-runs or offers 
to run alternative queries.  No de-duping there, unless you do it in your app.  I think 
one problem with the described approach and Lucene would be that Lucene's scores are not 
absolute.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: J. Delgado [EMAIL PROTECTED]
To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org
Sent: Monday, April 9, 2007 3:46:40 AM
Subject: Progressive Query Relaxation

Has anyone within the Lucene or Solr community attempted to code a
progressive query relaxation technique similar to the one described
here for Oracle Text?
http://www.oracle.com/technology/products/text/htdocs/prog_relax.html

Thanks,

-- J.D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing the Interesting Part Only...

2007-03-09 Thread J. Delgado


You have to build a special HTML Junk parser.

2007/3/9, d e [EMAIL PROTECTED]:


If I'm indexing a news article, I want to avoid getting the junk (other
than
the title, auther and article) into the index. I want to avoid getting the
advertizments, etc. How do I do that sort of thing?

What parts of what manual should I be reading so I will know how to do
this
sort of thing.

Re: Reviving Nutch 0.7

2007-01-23 Thread J. Delgado


Nutch Newbie wrote:

Again not really proposing a new project but more easy to use
re-usable code. IMHO, Nutch will be an umbrella project for
ala-Google and Solr will be for ala-Enterpise  where Lucene
is the index lib, Hadoop is the Mapred/DFS lib ..what is missing is
Common Crawler lib, Common
indexing lib etc..


EXACTLY!

-- Joaquin

Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado


This is a more general question:

Given the fact that most applications require querying a combination
of full-text and structured data has anyone looked into building data
structures at the most fundamental level  (e.g. combination of b-tree
and inverted lists) that would enable scalable and performant
structured (e.g.SQL or XQuery) + Full-Text queries?

Can Lucene be taken as basis for this or do you recommend exploring
other routes?

-- Joaquin

2007/1/10, Chris Hostetter [EMAIL PROTECTED]:


: So you mean lucene can't do better than this ?

robert's point is that based on what you've told us, there is no reason to
think Lucene makes sense for you -- if *all* you are doing is finding
documents based on numeric rnages, then a relational database is petter
suited to your task.  if you accutally care about the tetual IR features
of Lucene, then there are probably ways to make your searches faster, but
you aren't giving us enough information.

you said the example code you gave was in a loop ... but a loop over what?
.. what cahnges with each iteration of the loop? ... if there are
RangeFilter's that ge reused more then once, CachingWrapperFilter can come
in handy to ensure that work isn't done more often then it needs to me.

it's also not clear wether your query on type:0 is just a placeholder,
or indicative of what you acctually want to do in the long run ... if all
of your queries are this simple, and all you care about is getting a count
of things that have type:0 and are in your numeric ranges, then don'g use
the search method at all, just put type:0 in your ChainedFilter and
call the bits method directly.

you also haven't given us any information about wether or not you are
opening a new IndexSearcher/IndexReader every time you execute a query, or
resuing the same instance -- reuse makes the perofrance much better
because it can reuse underlying resources.

In short: if you state some performance numbers from timing some code, and
want to know how to make that code faster, you have to actualy show people
*all* of the code for them to be able to help you.


:   I still have the search problem I had before, now search takes around
:  750
:  msecs for a small set of documents.
: 
:  [java] Total Query Processing time (msec) : 38745
:  [java] Total No. of Documents : 7,500,000
:  [java] Total No. of Executed queries : 50.0
:  [java] Execution time per query : 774.9 msec
: 
:   The index is optimized and its size is 830 MB.
:   Each document has the following terms :
:  VSID(integer), data(float), type(short int) , precision (byte).
:The queries are generate in a loop similar to one below :
:  loop ...
:  RangeFilter rq1 = new
:  RangeFilter(data,+5.4324324344,+5.4324324344true,true);
:  RangeFilter rq2 = new RangeFilter
:  (precision,+0001,+0002,true,true);
:  ChainedFilter cf = new ChainedFilter(new
:  Filter[]{rq2,rq1},ChainedFilter.AND);
:  Query query = qp.parse(type:0);
:  Hits hits = searcher.search(query,cf);
:  end loop
: 
:   I would like to know if there exist any solution to improve the search
:  time ?  (I need to insert more than 500 million of these data pages into
:  lucene)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado


This sounds very interesting... I'll defenitely have a look into it.
However I have the feeling that, like the use of Oracle Text, this is
keeping seperate the underlying data structures used for evaluating
full-text and conditions over other data types, which brings up other
issues when trying to do full-blown mixed queries. Things get worse
when doing joins and other relational algebra operations.

I'm still wondering if the basic data structures should be revised to
achieve better performance...

-- Joaquin

2007/1/10, robert engels [EMAIL PROTECTED]:

There is a module in Lucene contrib that changes that! It loads
Lucene into the Oracle database (it has a JVM), and allows Lucene
syntax to perform full-text searching.

On Jan 10, 2007, at 2:37 PM, J. Delgado wrote:

 No, Oracle Text does not use Lucene. It has its own proprietary
 full-text engine. It represents documents, the inverted index and
 relationships in a DB schema and it depends heavily on the SQL layer.
 This has some severe limitations though...

 Of course, you can push structured data into full-text based indexes.
 We have seen how in Lucene we can represent some structured data types
 (e.g. dates, numbers) as fields and perform some type of mixed queries
 but the Lucene index, as some of you have pointed out, is not meant
 for this and does not scale like a DB would.

 I'm looking to hear new ideas people may have to solve this very
 hard problem.

 -- Joaquin

 2007/1/10, robert engels [EMAIL PROTECTED]:
 I think the contrib 'Oracle Full Text' does this (although in the
 reverse).

 It uses Lucene for full text queries (embedded into the db), the
 query analyzer works.

 It is really a great piece of software. Do bad it can't be done in a
 standard way so that it would work with all dbs.

 I think it may be possible to embedded the Apache Derby to do
 something like this, although this might be overkill. A simple b-tree
 db might work best.

 It would be interesting if the documents could be stored in a btree,
 and a GUID used to access them (since the lucene docid is constantly
 changing). The only stored field in a lucene Document would be the
 GUID.

 On Jan 10, 2007, at 2:21 PM, J. Delgado wrote:

  This is a more general question:
 
  Given the fact that most applications require querying a
 combination
  of full-text and structured data has anyone looked into building
 data
  structures at the most fundamental level  (e.g. combination of b-
 tree
  and inverted lists) that would enable scalable and performant
  structured (e.g.SQL or XQuery) + Full-Text queries?
 
  Can Lucene be taken as basis for this or do you recommend exploring
  other routes?
 
  -- Joaquin
 
  2007/1/10, Chris Hostetter [EMAIL PROTECTED]:
 
  : So you mean lucene can't do better than this ?
 
  robert's point is that based on what you've told us, there is no
  reason to
  think Lucene makes sense for you -- if *all* you are doing is
 finding
  documents based on numeric rnages, then a relational database is
  petter
  suited to your task.  if you accutally care about the tetual IR
  features
  of Lucene, then there are probably ways to make your searches
  faster, but
  you aren't giving us enough information.
 
  you said the example code you gave was in a loop ... but a loop
  over what?
  .. what cahnges with each iteration of the loop? ... if there are
  RangeFilter's that ge reused more then once, CachingWrapperFilter
  can come
  in handy to ensure that work isn't done more often then it needs
  to me.
 
  it's also not clear wether your query on type:0 is just a
  placeholder,
  or indicative of what you acctually want to do in the long run ...
  if all
  of your queries are this simple, and all you care about is getting
  a count
  of things that have type:0 and are in your numeric ranges, then
  don'g use
  the search method at all, just put type:0 in your
  ChainedFilter and
  call the bits method directly.
 
  you also haven't given us any information about wether or not
 you are
  opening a new IndexSearcher/IndexReader every time you execute a
  query, or
  resuing the same instance -- reuse makes the perofrance much
 better
  because it can reuse underlying resources.
 
  In short: if you state some performance numbers from timing some
  code, and
  want to know how to make that code faster, you have to actualy
  show people
  *all* of the code for them to be able to help you.
 
 
  :   I still have the search problem I had before, now search
  takes around
  :  750
  :  msecs for a small set of documents.
  : 
  :  [java] Total Query Processing time (msec) : 38745
  :  [java] Total No. of Documents : 7,500,000
  :  [java] Total No. of Executed queries : 50.0
  :  [java] Execution time per query : 774.9 msec
  : 
  :   The index is optimized and its size is 830 MB.
  :   Each document has the following terms :
  :  VSID(integer), data(float), type(short int) , precision
  (byte).
  :The queries are generate in a loop

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado


(Sorry for the cross-posting)

This is a full-time position with an exciting New Venture (now in
stealth mode) and will be based out of Sunnyvale, CA.

We are looking for Java Developer with search, social networks and/or
payment processing related experience.

Required Skills:

2+ yrs of industrial experience on Search technologies/Engines like
Lucene/Nutch/Solr, Oracle,Fast, Endeca, etc. as well
as XML and relational database technologies and/or on development of
transactional payment systems (e.g. PayPal).

- Experience with classification, attribute matching and/or
collaborative filtering
- Some exposure to P2P technologies (transactions, communication and
social networks) is highly desirable.
- Understanding of ontologies/taxonomies, keyword libraries, and other
databases to assist search query interpretation and formulation.
- Prefer MS or Computer Science graduate with specialization in
Information Retrieval or Data Mining.
- Willing to train a junior candidate.
- Must be *hands-on*.
- Ability to work quickly and accurately in a high-volume work environment.
- Excellent analytical skills.
- Creativity, intelligence, and integrity.
- Strong work ethic and a high level of professionalism.
- Hands-on design and development skills in Java and J2EE technologies
- Experience in development of large scale Web Portals is a plus.

If interested, please send the resume with contact info and salary
expectations at the earliest.

Less experienced AJAX/Web 2.0 Java Developers are also welcomed to
submit their resume.

Joaquin Delgado, PhD.
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado


This is a full-time position with an exciting New Venture (now in
stealth mode) and will be based out of Sunnyvale, CA.

We are looking for Java Developer with search, social networks and/or
payment processing related experience.

Required Skills:

2+ yrs of industrial experience on Search technologies/Engines like
Lucene/Nutch/Solr, Oracle,Fast, Endeca, etc. as well
as XML and relational database technologies and/or on development of
transactional payment systems (e.g. PayPal).

- Experience with classification, attribute matching and/or
collaborative filtering
- Some exposure to P2P technologies (transactions, communication and
social networks) is highly desirable.
- Understanding of ontologies/taxonomies, keyword libraries, and other
databases to assist search query interpretation and formulation.
- Prefer MS or Computer Science graduate with specialization in
Information Retrieval or Data Mining.
- Willing to train a junior candidate.
- Must be *hands-on*.
- Ability to work quickly and accurately in a high-volume work environment.
- Excellent analytical skills.
- Creativity, intelligence, and integrity.
- Strong work ethic and a high level of professionalism.
- Hands-on design and development skills in Java and J2EE technologies
- Experience in development of large scale Web Portals is a plus.

If interested, please send the resume with contact info and salary
expectations at the earliest.

Less experienced AJAX/Web 2.0 Java Developers are also welcomed to
submit their resume.

Joaquin Delgado, PhD.
[EMAIL PROTECTED]

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado


This is a full-time position with an exciting New Venture (now in
stealth mode) and will be based out of Sunnyvale, CA.

We are looking for Java Developer with search, social networks and/or
payment processing related experience.

Required Skills:

2+ yrs of industrial experience on Search technologies/Engines like
Lucene/Nutch/Solr, Oracle,Fast, Endeca, etc. as well
as XML and relational database technologies and/or on development of
transactional payment systems (e.g. PayPal).

- Experience with classification, attribute matching and/or
collaborative filtering
- Some exposure to P2P technologies (transactions, communication and
social networks) is highly desirable.
- Understanding of ontologies/taxonomies, keyword libraries, and other
databases to assist search query interpretation and formulation.
- Prefer MS or Computer Science graduate with specialization in
Information Retrieval or Data Mining.
- Willing to train a junior candidate.
- Must be *hands-on*.
- Ability to work quickly and accurately in a high-volume work environment.
- Excellent analytical skills.
- Creativity, intelligence, and integrity.
- Strong work ethic and a high level of professionalism.
- Hands-on design and development skills in Java and J2EE technologies
- Experience in development of large scale Web Portals is a plus.

If interested, please send the resume with contact info and salary
expectations at the earliest.

Less experienced AJAX/Web 2.0 Java Developers are also welcomed to
submit their resume.

Joaquin Delgado, PhD.
[EMAIL PROTECTED]

69 matches

Mail list logo