Hi all, I prepared a special edition of Luke, the Lucene Index Toolbox, that works with Lucene indexes located on any filesystem supported by Hadoop 0.19.1.
At the moment I'm looking for feedback how to best integrate this functionality with various bits and pieces of Luke. You can download the jar file from a direct link: http://www.getopt.org/luke/lukeall-0.9.3.jar This JAR contains all dependencies needed to connect to HDFS, KFS or S3/S3n filesystems, although I tested it only with HDFS so far. Note: this version of Luke still uses Lucene 2.4.1, I didn't start integrating 2.9-dev yet. Quick info for the impatient: yes, you can browse the content, view terms and documents, perform searching, explaining, etc. See below for more details. The initial Open dialog is not integrated yet with this functionality. After you start Luke, you need to dismiss this dialog, go to Plugins / Hadoop Plugin, and enter the full URI of the index in the textfield, and then press the Open button. There is no filesystem browsing for now - you need to know the full URI in advance. Current functionality is as follows: - you can open a single index or partial (sharded) indexes located in part-NNNNN/ subdirectories (this is a typical layout resulting from using common map-reduce output formats). In the latter case you will get a single view of partial indexes, thanks to MultiReader. - access is read-only - most FileSystem-s don't support file updates, so it was easiest to disable write access altogether for now. - most of Luke functionality works properly, thanks to the excellent design of IndexReader API. Some operations are disabled due to read-only access, some other information (like top terms) is not populated by default due to a high IO cost, but can be requested explicitly. - the plugin keeps track of the amount of IO reads - I found this very comforting when opening large indexes over a slow VPN line ... There is a "Clear" button on the plugin's tab that resets the counters - this is useful to see how much IO is needed to complete a specific operation. - a lot of code has been reworked to avoid UI stalls when doing slow IO, which means that you can see the amount of IO being done, but the UI is blocked with a modal dialog. It's a bit unwieldy, but other solutions would require too much refactoring. Any feedback is welcome - please keep in mind that this is an early preview. Also, various UI glitches are probably related to the Thinlet toolkit - again, one day I may re-write Luke using something else, but for now I don't have the strength to do it. :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com