mg4j - Managing Gigabyte for Java

Paolo Boldi Fri, 17 Sep 2004 03:15:24 -0700

Hi everybody!

When we started the MG4J Project, we did not want to make anything like Lucene, but 
rather, 
as the project name suggests, to produce a Java version of the MG (Managing Gigabytes) 
project
of Moffat et al.

In the first stage, we decided to focus on indexing rather than document compression, 
because that was
our primary need (we wanted to use MG4J in the context of our other projects about Web 
crawling,
querying and compression, http://webgraph.dsi.unimi.it/ and 
http://ubi.imc.pi.cnr.it/projects/ubicrawler/).

The idea was to have a Java library to create and access inverted indices. To do that, 
we
also needed efficient bit-level manipulation classes, and raw variable-length encoding
of integers.
This was not aimed to end users, but to developers, even though we had (and still 
have) in mind
that eventually some tools to make it easy-to-use should be provided anyway.

Otis is only partially right: the old version of MG4J did not contain any way to 
search the index,
but the current release (http://mg4j.dsi.unimi.it/) has some basic search 
capabilities, and you can query the index
with general boolean expressions (OR, AND, NOT, full-phrase etc.). Also, we made the 
overall
structure a lot more flexible and easy-to-use. Most of the new features are still 
experimental,
and only partially documented: we plan to give a full account of the new stuff in the 
next
few weeks/months, both by providing more documentation and by writing some research 
paper
about some features that are new in the field.

Still, MG4J has different aims than Lucene, and so the two projects are quite 
incomparable:

- MG4J assumes that you provide documents in the very rough form of word sequences: 
you should
do the tokenization/parsing by yourself
- MG4J has no concept of "field", but the new version introduces a (much more 
rudimentary) notion of
different indices built over the same document collection (like, for example, a 
mailbox indexed by subject,
author, content etc.)
- on the other hand, MG4J puts much emphasis on the usage of state-of-the-art 
compression and
querying techniques (the new version contains experimental classes to produce indices 
with multilevel skip lists,
lazy search and semantically-sound multi-index query), so you can expect to have 
usually smaller indices
and faster searches.

Bye

                                Paolo Boldi

> Hi Anson,
> 
> It's not quite correct to comparing MG4J and Lucene directly.  Lucene
> is a toolkit whose primary goal is to let you create an index and
> search it, while MG4J is really a library of Java classes that people
> implementing an IR library (such as Lucene, for example) may find
> useful.  You cannot create a searchable index with MG4J alone.
> 
> Otis
> 
> 
> --- Anson Lau <[EMAIL PROTECTED]> wrote:
> 
> > Hi All,
> > 
> > Has anyone seen the project MG4J (Managing Gigabyte for Java)
> > http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
> > and MG4J to comment on how the two compares?
> > 
> > Thanks,
> > 
> > Anson
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

mg4j - Managing Gigabyte for Java

Reply via email to