Hi Sujoy, As you cannot see on the website <http://openlibrary.org/books/OL7974826M/Religious_Freedom_1965_and_1975>, but can see in the JSON view <http://openlibrary.org/books/OL7974826M.json>, it's not just "corrupted" in the data dump. My first thought is that the original subject is split incorrectly. Looking further into the source of the data, it seems that the original data is from Amazon.com: <http://www.amazon.com/gp/product/0809119935>. The subjects mentioned on the Amazon page are the same strange subjects, suggesting it's Amazon's "fault".
The terms in the second subject element from the ISBN DB(?) are in Amazon too, but not in the OL record. That is strange, but probably lost during import. Perhaps the import process has improved since the import of this record - I don't know. Unfortunately, Open Library contains a lot of "errors". Fortunately, you (and everybody else) can correct those. What the best solution is for you depends on your goals with the data, I think. You could write software that tries to combine the "(2nd :" and "1962-1965)", but I don't know whether anyone would ever use that in a subject search. Your software could discard subjects that are not so useful (to you), like these. You could manually fix the subjects on openlibrary.org. In this case, clicking Edit on the book page would automatically create a work. You can add and edit subjects for works. If there is no redistribution of data, you can enhance the subject information using any other source you like (e.g. the ISBN DB you mention). If you know of a subject database that can be reused according to its license, even better. If you find certain sets of records contain many similar errors (like I found in the physical_description field some time ago), you could write a bot that automatically improves the live Open Library data. Since subjects are only editable from the web when they are in works, you should look in the work record that the book belongs to for possibly updated subject terms. This book has no Work yet (at the time of writing), but you can add one by editing the edition. Does this answer your question? Regards, Ben On 2 May 2012 15:49, Sujoy Ghosh <[email protected]> wrote: > Hi, > > We have downloaded the open library edition data dump from following link > http://openlibrary.org/data/ol_dump_authors_latest.txt.gz > > While parsing the data, we found the subjects fields are corrupted for many > editions. eg. > > For /books/OL7974826M (isbn=0809119935), the subjects filed is given > following value > > "subjects": ["1962-1965)", "Congresses", "Declaratio de libertate religi", > "(2nd :", "Vatican Council"] > > By using isbndb api I got below subjects data > > <Subjects> > <Subject > subject_id="vatican_council_2nd_1962_1965_declaratio_de_libertate_religi"> > Vatican Council -- (2nd :1962-1965). -- Declaratio de libertate > religiosa -- Congresses > </Subject> > <Subject subject_id="freedom_of_religion_congresses">Freedom of > religion -- Congresses</Subject> > </Subjects> > > It's clearly visible that the open library data is corrupted in the subjects > filed. This is observed in so many other editions also in the dump. > > Can you please help us to find out the correct data? Can you suggest any > solution? > > Rgds, > Sujoy > > > > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] > _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
