On 2010-05-09 12:23, Rafael Kubina wrote: > Hi > > i´m trying to do a full text search on my java souces (.java) via nutch > (1.0), svn and http (mod_dav_svn). > > other documents like html are pretty searchable, my sources not. > > currently the output ist the following: > > fetching http://s025/svn/java/foo/trunk/src/main/java/Bar.java > Pre-configured credentials with scope - host: s025; port: 80; found for > url: http://s025/svn/java/foo/trunk/src/main/java/Bar.java > url: http://s025/svn/java/foo/trunk/src/main/java/Bar.java; status > code: 200; bytes received: 5829; Content-Length: 5829 > > the content-type for this file is text/plain > > there are no exceptions, no other problems. > > i really appreciate any help that I can get. Thanks a lot!
You need to check the following: * parse_text in your segment (you can dump this with readseg command). It should contain a plain text content of your file. * use Luke (www.getopt.org/luke) to examine your Lucene index. You should be able to retrieve terms coming from your Java documents - use Reconstruct & Edit in Luke. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com