So will this throw an exception on older segments? or will it just not
get the correct metadata? I have a lot of older segments I still need to
use.
Thanks for your help.
-Matt Zytaruk
Andrzej Bialecki wrote:
Matt Zytaruk wrote:
Here you go.
java.lang.ClassCastException: java.util.ArrayList
at org.apache.nutch.parse.ParseData.write(ParseData.java:122)
at org.apache.nutch.parse.ParseImpl.write(ParseImpl.java:51)
at
org.apache.nutch.fetcher.FetcherOutput.write(FetcherOutput.java:57)
at
org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:168)
at org.apache.nutch.mapred.MapTask$1.collect(MapTask.java:78)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:229)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:123)
Congratulations! You are the first person to actually use (and suffer
from) the multiple values in ContentProperties... ;-)
It turns out that ParseData.write() uses its own method for writing
out metadata, instead of using ContentProperties.write(). It works
well if you only have single values (then they are stored as Strings),
but if there are multiple values they are stored in ArrayLists, which
ParseData accesses directly by the virtue of using
metadata.entrySet().iterator().
The fix is easy: please replace the following lines in ParseData.write():
out.writeInt(metadata.size()); // write metadata
Iterator i = metadata.entrySet().iterator();
while (i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
UTF8.writeString(out, (String)e.getKey());
UTF8.writeString(out, (String)e.getValue());
}
with this:
metadata.write(out);
and the same for reading the metadata field; replace in
ParseData.readField() this:
int propertyCount = in.readInt(); // read metadata
metadata = new ContentProperties();
for (int i = 0; i < propertyCount; i++) {
metadata.put(UTF8.readString(in), UTF8.readString(in));
}
with this:
metadata = new ContentProperties();
metadata.readFields(in);
Compile, deploy, test, report ... :-) Please note that this changes
the on-disk segment format, so you won't be able to read the old
segments with the new code. You may want to bump the
ParseData.VERSION, and leave this code to handle older versions...