Can you specify which parts you consider to be corrupted? The OL subject treatment does break up the LCSH subfields into separate subject terms (somewhat like FAST [1]). It looks to me like the date subfield "(2nd :1962-1965)" was parsed into two parts, and of course the punctuation in the subfield is causing problems. The other issue I see is that the subfield ending with "religiosa" got truncated. Are there other problems that you see?
kc On 5/2/12 6:49 AM, Sujoy Ghosh wrote: > Hi, > > We have downloaded the open library edition data dump from following link > http://openlibrary.org/data/ol_dump_authors_latest.txt.gz > > While parsing the data, we found the subjects fields are corrupted for > many editions. eg. > > For /books/OL7974826M (isbn=0809119935), the subjects filed is given > following value > > "subjects": ["1962-1965)", "Congresses", "Declaratio de libertate > religi", "(2nd :", "Vatican Council"] > > By using isbndb api I got below subjects data > > <Subjects> > <Subject > subject_id="vatican_council_2nd_1962_1965_declaratio_de_libertate_religi"> > Vatican Council -- (2nd :1962-1965). -- Declaratio de libertate > religiosa -- Congresses > </Subject> > <Subject subject_id="freedom_of_religion_congresses">Freedom of religion > -- Congresses</Subject> > </Subjects> > > It's clearly visible that the open library data is corrupted in the > subjects filed. This is observed in so many other editions also in the dump. > > Can you please help us to find out the correct data? Can you suggest any > solution? > > Rgds, > Sujoy > > > > > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] -- Karen Coyle [email protected] http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
