Hi everybody! When we started the MG4J Project, we did not want to make anything like Lucene, but rather, as the project name suggests, to produce a Java version of the MG (Managing Gigabytes) project of Moffat et al.
In the first stage, we decided to focus on indexing rather than document compression, because that was our primary need (we wanted to use MG4J in the context of our other projects about Web crawling, querying and compression, http://webgraph.dsi.unimi.it/ and http://ubi.imc.pi.cnr.it/projects/ubicrawler/). The idea was to have a Java library to create and access inverted indices. To do that, we also needed efficient bit-level manipulation classes, and raw variable-length encoding of integers. This was not aimed to end users, but to developers, even though we had (and still have) in mind that eventually some tools to make it easy-to-use should be provided anyway. Otis is only partially right: the old version of MG4J did not contain any way to search the index, but the current release (http://mg4j.dsi.unimi.it/) has some basic search capabilities, and you can query the index with general boolean expressions (OR, AND, NOT, full-phrase etc.). Also, we made the overall structure a lot more flexible and easy-to-use. Most of the new features are still experimental, and only partially documented: we plan to give a full account of the new stuff in the next few weeks/months, both by providing more documentation and by writing some research paper about some features that are new in the field. Still, MG4J has different aims than Lucene, and so the two projects are quite incomparable: - MG4J assumes that you provide documents in the very rough form of word sequences: you should do the tokenization/parsing by yourself - MG4J has no concept of "field", but the new version introduces a (much more rudimentary) notion of different indices built over the same document collection (like, for example, a mailbox indexed by subject, author, content etc.) - on the other hand, MG4J puts much emphasis on the usage of state-of-the-art compression and querying techniques (the new version contains experimental classes to produce indices with multilevel skip lists, lazy search and semantically-sound multi-index query), so you can expect to have usually smaller indices and faster searches. Bye Paolo Boldi > Hi Anson, > > It's not quite correct to comparing MG4J and Lucene directly. Lucene > is a toolkit whose primary goal is to let you create an index and > search it, while MG4J is really a library of Java classes that people > implementing an IR library (such as Lucene, for example) may find > useful. You cannot create a searchable index with MG4J alone. > > Otis > > > --- Anson Lau <[EMAIL PROTECTED]> wrote: > > > Hi All, > > > > Has anyone seen the project MG4J (Managing Gigabyte for Java) > > http://mg4j.dsi.unimi.it/ ? Anybody knows enough about both Lucene > > and MG4J to comment on how the two compares? > > > > Thanks, > > > > Anson > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]