The parse-mp3 plugin seems to be saving a state of the previous  
parse's text content. For every new mp3 file parsed, it is putting  
the contents of all the previous text fields in the plain text field  
for that file.

You can see this by fetching a set of mp3s in one segment, then  
viewing their plain text in the nutch webapp. The plaintext will  
include the contents of all files fetched in that round, which makes  
searching fruitless.

I made a tiny band-aid change to MP3Parser.java and  
MetadataCollector.java against the nightly. It seems to fix the problem.


--- MP3Parser.java      2006-12-10 09:43:26.000000000 -0500
+++ MP3Parser.java.new  2006-12-10 16:37:03.000000000 -0500
@@ -67,7 +67,7 @@
        fos.write(raw);
        fos.close();
        MP3File mp3 = new MP3File(tmp);
-
+         metadataCollector.clearText();
        if (mp3.hasID3v2Tag()) {
          parse = getID3v2Parse(mp3, content.getMetadata());
        } else if (mp3.hasID3v1Tag()) {

--- MetadataCollector.java      2006-12-10 09:43:26.000000000 -0500
+++ MetadataCollector.java.new  2006-12-10 16:37:28.000000000 -0500
@@ -42,6 +42,10 @@
        this.conf = conf;
    }

+  public void clearText() {
+       text = "";
+  }
+
    public void notifyProperty(String name, String value) throws  
MalformedURLException {
      if (name.equals("TIT2-Text"))
        setTitle(value);







-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to