Doron - this looks super useful!
Can you give an example for the lexical affinities you mention here? ("Juru
creates posting lists for lexical affinities")
Also:
"Normalized term-frequency, as in Juru.
Here, tf(freq) is normalized by the average term frequency of the document."
I've never seen this mentioned anywhere except here and once here on the ML
(was it you who mentioned this?), but this sounds intuitive. What do others
think?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Apache Wiki <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Wednesday, January 30, 2008 5:15:02 PM
Subject: [Lucene-java Wiki] Update of "TREC 2007 Million Queries Track - IBM
Haifa Team" by DoronCohen
Dear
Wiki
user,
You
have
subscribed
to
a
wiki
page
or
wiki
category
on
"Lucene-java
Wiki"
for
change
notification.
The
following
page
has
been
changed
by
DoronCohen:
http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team
The
comment
on
the
change
is:
Initial
version,
some
data
still
missing...
New
page:
=
TREC
2007
Million
Queries
Track
-
IBM
Haifa
Team
=
The
[http://ciir.cs.umass.edu/research/million/
Million
Queries
Track]
ran
for
the
first
time
in
2007.
Quoting
from
the
track
home
page:
*
"The
goal
of
this
track
is
to
run
a
retrieval
task
similar
to
standard
ad-hoc
retrieval,
but
to
evaluate
large
numbers
of
queries
incompletely,
rather
than
a
small
number
more
completely.
Participants
will
run
10,000
queries
and
a
random
1,000
or
so
will
be
evaluated.
The
corpus
is
the
terabyte
track's
GOV2
corpus
of
roughly
25,000,000
.gov
web
pages,
amounting
to
just
under
half
a
terabyte
of
data."
We
participated
in
this
track
with
two
search
engines
-
our
home
brewed
search
engine
[http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
Juru].
The
official
reports
and
papers
of
the
track
should
be
available
sometimes
in
February
2008,
but
here
is
a
summary
of
the
results
and
our
experience
with
our
first
ever
Lucene
submission
to
TREC.
In
summary,
the
out-of-the-box
search
quality
was
not
so
great,
but
by
altering
how
we
use
Lucene
(that
is,
our
application)
and
with
some
modifications
to
Lucene,
we
were
able
to
improve
the
search
quality
results
and
to
score
good
in
this
competition.
The
lessons
we
learned
can
be
of
interest
to
applications
using
Lucene,
to
Lucene
itself,
and
to
researchers
submitting
to
other
TREC
tracks
(or
elsewhere).
=
Training
=
As
preparation
for
the
track
runs
we
"trained"
Lucene
on
queries
from
previous
years
tracks
-
more
exactly
on
the
150
short
TREC
queries
for
which
there
are
existing
judgments
from
previous
years,
for
the
same
GOV2
data.
We
build
an
index
-
actually
27
indexes
-
for
this
data.
For
indexing
we
used
the
Trec-Doc-Maker
that
is
now
in
Lucene's
contrib
benchmark
(or
a
slight
modification
of
it).
We
found
that
best
results
are
obtained
when
all
data
is
in
a
single
field,
and
so
we
did,
keeping
only
stems
(English,
Porter,
from
Lucene
contrib).
We
used
the
Standard-Analyzer,
with
a
modified
stoplist
that
took
into
account
that
domain
specific
stopwords.
Running
with
both
Juru
and
Lucene,
and
having
obtained
good
results
with
Juru
in
previous
years,
we
had
something
to
compare
to.
For
this,
we
made
sure
to
HTML
parse
the
documents
in
the
same
way
in
both
systems
(we
used
Juru's
HTML
parser
for
this)
and
use
the
same
stoplist
etc.
In
addition,
anchor
text
was
collect
in
a
pre-indexing
global
analysis
pass,
and
so
anchors
of
(pointing
to)
pages
where
indexed
with
the
page
they
point
to,
up
to
a
limited
size.
The
number
of
in-links
to
each
page
was
saved
in
a
stored
field
and
we
used
it
as
a
static
score
element
(boosting
documents
that
had
more
in-links).
The
way
that
anchors
text
was
extracted
and
prepared
for
indexing
will
be
described
in
the
full
report.
=
Results
=
The
initial
results
were:
||<rowbgcolor="#80FF80">'''Run'''||'''MAP'''||'''[EMAIL
PROTECTED]'''||'''[EMAIL PROTECTED]'''||'''[EMAIL PROTECTED]'''||
||
1.
Juru
||
0.313
||
0.592
||
0.560
||
0.529
||
||
2.
Lucene
out-of-the-box
||
0.154
||
0.313
||
0.303
||
0.289
||
We
made
the
following
changes:
1.
Add
a
proximity
scoring
element,
basing
on
our
experience
with
"Lexical
affinities"
in
Juru.
Juru
creates
posting
lists
for
lexical
affinities.
In
Lucene
we
used
augmented
the
query
with
Span-Near-Queries.
1.
Phrase
expansion
-
the
query
text
was
added
to
the
query
as
a
phrase.
1.
Replace
the
default
similarity
by
Sweet-Spot-Similarity
for
a
better
choice
of
document
length
normalization.
Juru
is
using
[http://citeseer.ist.psu.edu/singhal96pivoted.html
pivoted
length
normalization]
and
we
experimented
with
it,
but
found
out
that
the
simpler
and
faster
sweet-spot-simiarity
performs
better.
1.
Normalized
term-frequency,
as
in
Juru.
Here,
tf(freq)
is
normalized
by
the
average
term
frequency
of
the
document.
So
these
are
the
updated
results:
||<rowbgcolor="#80FF80">'''Run'''
||'''MAP'''||'''[EMAIL PROTECTED]'''||'''[EMAIL PROTECTED]'''||'''[EMAIL
PROTECTED]'''||
||
1.
Juru
||
0.313
||
0.592
||
0.560
||
0.529
||
||
2.
Lucene
out-of-the-box
||
0.154
||
0.313
||
0.303
||
0.289
||
||
3.
Lucene
+
LA
+
Phrase
+
Sweet
Spot
+
tf-norm
||
0.306
||
0.627
||
0.589
||
0.543
||
The
improvement
is
dramatic.
Perhaps
even
more
important,
once
the
track
results
were
published,
we
found
out
that
these
improvement
are
consistent
and
steady,
and
so
Lucene
with
these
changes
was
ranked
high
also
by
the
two
new
measures
introduced
in
this
track
-
NEU-Map
and
E-Map
(Epsilon-Map).
With
these
new
measures
more
queries
are
evaluated
but
less
documents
are
judged
for
each
query.
The
algorithms
for
documents
selection
for
judging
(during
the
evaluation
stage
of
the
track)
were
not
our
focus
in
this
work
-
as
there
were
actually
two
goals
to
this
TREC:
*
the
systems
evaluation
(our
main
goal)
and
*
the
evaluation
itself.
The
fact
that
modified
Lucene
scored
well
in
both
the
traditional
150
queries
and
the
new
1700
evaluated
queries
with
the
new
measures
was
reassuring
for
the
"usefulness"
or
perhaps
"validity"
of
these
modifications
to
Lucene.
For
certain
these
changes
are
not
a
100%
fit
for
every
application
and
every
data,
but
these
results
are
strong,
and
so
I
believe
can
be
be
valuable
for
many
applications,
and
certainly
for
research
aspects.
=
Search
time
penalty
=
These
improvements
did
not
come
for
free.
Adding
a
phrase
to
the
query
and
adding
Span-Near-Queries
for
every
pair
of
query
words
costs
query
time.
The
search
time
of
stock
Lucene
in
our
setup
was
1.4
seconds/query.
The
modified
search
time
took
8.0
seconds/query.
This
is
a
large
slowdown!
But
it
should
be
noticed
that
in
this
work
we
did
not
focus
in
search
time,
only
in
quality.
Now
is
the
time
to
see
how
the
search
time
penalty
can
be
reduced
while
keeping
most
of
the
search
time
improvements.
=
Implementation
Details
=
*
Contrib
benchmark
quality
package
was
used
for
the
search
quality
measures
and
submissions.
/!\
To
be
completed...
=
More
Detailed
Results
=
/!\
To
be
added...
=
Possible
Changes
in
Lucene
=
*
Move
Sweet-Spot-Similarity
to
core
*
Make
Sweer-Spot-Similarity
the
default
similarity?
*
Easier
and
more
efficient
ways
to
add
proximity
scoring?
*
Allow
easier
implementation/extension
of
tf-normalization
/!\
To
be
completed
&
refined...
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]