Hi Sujoy,

As you cannot see on the website
<http://openlibrary.org/books/OL7974826M/Religious_Freedom_1965_and_1975>,
but can see in the JSON view
<http://openlibrary.org/books/OL7974826M.json>, it's not just
"corrupted" in the data dump. My first thought is that the original
subject is split incorrectly.
Looking further into the source of the data, it seems that the
original data is from Amazon.com:
<http://www.amazon.com/gp/product/0809119935>. The subjects mentioned
on the Amazon page are the same strange subjects, suggesting it's
Amazon's "fault".

The terms in the second subject element from the ISBN DB(?) are in
Amazon too, but not in the OL record. That is strange, but probably
lost during import. Perhaps the import process has improved since the
import of this record - I don't know.

Unfortunately, Open Library contains a lot of "errors". Fortunately,
you (and everybody else) can correct those.

What the best solution is for you depends on your goals with the data, I think.
You could write software that tries to combine the "(2nd :" and
"1962-1965)", but I don't know whether anyone would ever use that in a
subject search. Your software could discard subjects that are not so
useful (to you), like these.
You could manually fix the subjects on openlibrary.org. In this case,
clicking Edit on the book page would automatically create a work. You
can add and edit subjects for works.
If there is no redistribution of data, you can enhance the subject
information using any other source you like (e.g. the ISBN DB you
mention). If you know of a subject database that can be reused
according to its license, even better.
If you find certain sets of records contain many similar errors (like
I found in the physical_description field some time ago), you could
write a bot that automatically improves the live Open Library data.

Since subjects are only editable from the web when they are in works,
you should look in the work record that the book belongs to for
possibly updated subject terms. This book has no Work yet (at the time
of writing), but you can add one by editing the edition.

Does this answer your question?

Regards,

Ben

On 2 May 2012 15:49, Sujoy Ghosh <[email protected]> wrote:
> Hi,
>
> We have downloaded the open library edition data dump from following link
> http://openlibrary.org/data/ol_dump_authors_latest.txt.gz
>
> While parsing the data, we found the subjects fields are corrupted for many
> editions. eg.
>
> For /books/OL7974826M (isbn=0809119935), the subjects filed is given
> following value
>
> "subjects": ["1962-1965)", "Congresses", "Declaratio de libertate religi",
> "(2nd :", "Vatican Council"]
>
> By using isbndb api I got below subjects data
>
> <Subjects>
>      <Subject
> subject_id="vatican_council_2nd_1962_1965_declaratio_de_libertate_religi">
>       Vatican Council -- (2nd :1962-1965). -- Declaratio de libertate
> religiosa -- Congresses
>       </Subject>
>       <Subject subject_id="freedom_of_religion_congresses">Freedom of
> religion -- Congresses</Subject>
> </Subjects>
>
> It's clearly visible that the open library data is corrupted in the subjects
> filed. This is observed in so many other editions also in the dump.
>
> Can you please help us to find out the correct data? Can you suggest any
> solution?
>
> Rgds,
> Sujoy
>
>
>
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to
> [email protected]
>
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to