Re: ANN: Sudoc bibliographic ans authority data

Antoine Isaac Sun, 10 Jul 2011 05:36:07 -0700

Hi,

Which side effects are probable ?



Giovanni has made the same comment on data.europeana.eu a couple of
weeks ago. The data we serve there is different from the RDFa mark-up
on our web portal.
We had some reasons to do this, including, well, that the RDFa data is
mixing the info and non-info resources for making easier data
consumption (not mandatorily by search engines, btw), and working with
URIs that pre-date our linked data service.

The RDFa and the RDF obtained with LD-style conneg is also not about
the same URIs, which should avoid any confusion.



I can't see any "confusion" if you publish complementary data about the same 
resource URI (in our case) through complementary technologies.
I can imagine a burden for crawlers and other data consumers, but where is the 
confusion ?



Yeah, perhaps "confusion" is not the best word. Still, I think getting 
different flavors of data in the path of implementers could be awkward, if no 
documentation is given, or if it does not correspond to a well-identified practice. If we 
manage to get the Sindice guys slightly puzzled by our stuff, I can imagine what others 
may feel...

But I can understand that if Sindice tries to fetch both data sources,
it may assume the data to be the same. And this assumption could bring
a number of undesirable side effects if Sindice merges all what it
gets...


In our case, i don't see the *risk* of merging, but my point of view is maybe 
too narrow.


That being said, perhaps the solution lies in Sindice being less
greedy ;-) and just work with the first data source it finds, for a
given URI.



Sindice is actually very temperate : it take only our RDF/XML data :)



Excellent!

Cheers,

Antoine

I do like the idea of having several (simple) channels for data
publication over the web, which serve different goals.
Maybe we need to better articulate the practices and expectations,
though...

Cheers,

Antoine

   Hi Giovanni,

Le 09/07/2011 23:10, Giovanni Tummarello a écrit :

Hi Nicolas,

Its getting in Sindice indeed -


Yes, I have noticed :)

quite politely e.g. 1 every 5 secs-
we'll monitor speed and completeness. iff you think its ok for us
to
crawl faster please say so via robot.txt directive or just say so

May I suggest that you crawl twice faster ?


http://sindice.com/search?q=book&nq=&fq=domain%3Awww.sudoc.fr&sortbydate=1&interface=advanced

at the same time i notice something funny in the markup e.g. if you
go
with a browser you get redirected to something that has almost no
data

for example the sitemap contains

http://www.sudoc.fr/000000043

if you go there you get redirected to

http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043

which if you put in the inspector

http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.abes.fr%2FDB%3D2.1%2FSRCH%3FIKT%3D12%26TRM%3D000000043#TRIPLES

you get very little data

however of course if i use the inspector on
http://www.sudoc.fr/000000043 i get data

http://inspector.sindice.com/inspect?url=http%3A%2F%2Fwww.sudoc.fr%2F000000043&content=&contentType=auto#TRIPLES

which however is mostly schema.org data!

but in sindice i have lots of RDF data with all sort of other
ontologies

http://sindice.com/search/page?url=http%3A%2F%2Fwww.sudoc.fr%2F000385123

is there any way you could try to normalize all into a single
markup
type? i think it would be easier to debug and ultimately better for
all..


I will try to explain our intention, our constraints and the
mechanism we've implemented.

- Intention -

We want to meet several needs :
. providing RDF/XML to semantic-oriented clients like Sindice
. providing HTML + schema.org microdata to traditional search
engines like Google
. providing an HTML UI to users


- Constraints -

. For some reasons, we can't add microdata to our traditional Sudoc
UI. Hence the necessity of special HTML+microdata pages for search
engines. :(
. HTML+microdata pages and RDF pages can't support the same
vocabularies, schema.org /oblige/.


- Mechanisms -

Let's start from : http://www.sudoc.fr/132133520

. If RDF/XML is called by the request, we provide RDF/XML content
(as if you had requested http://www.sudoc.fr/132133520.rdf)
It is what Sindice Crawler is doing and getting : the 55,764
documents that are found in your index are composed of triples
extracted from this RDF/XML page. It is what we expected. Fine :)

. If our Apache server considers a user agent to be a robot and if
this agent does not ask for RDF/XML, we provide special HTML content
(as if you had requested http://www.sudoc.fr/132133520.html)
It seems to work as Google cache contains this kind of HTML +
schema.org microdata pages :
http://webcache.googleusercontent.com/search?q=cache:www.sudoc.fr/000000043

. In other cases, we redirect to our traditional and non semantic UI
: http://www.sudoc.abes.fr/DB=2.1/SRCH?IKT=12&TRM=000000043
. NB : we have planned to add this<link>  in this HTML page :<link
rel="alternate" type="application/rdf+xml"
href="http://www.sudoc.fr/000000043.rdf"/>  and<link rel="canonical"
href="http://www.sudoc.fr/000000043"/>  to alleviate the URL
confusion.


- - - - -

. It is not simple, but it seems to work, ie Google, Sindice and
users seem to get what they should.
. Is there a better way to obtain the same results ?
. Which side effects are probable ?

Thanks for your help and your attention !

Yann


looking forward to support
Giovanni
Gio


On Fri, Jul 8, 2011 at 1:27 PM, Kingsley
Idehen<[email protected]>  wrote:

On 7/8/11 8:31 AM, Yann NICOLAS wrote:

Le 08/07/2011 01:42, Kingsley Idehen a écrit :

On 7/7/11 10:17 PM, Yann NICOLAS wrote:

Bonjour,

Sudoc [1], the French academic union catalogue maintained by ABES
[2], has
just been released as linked open data.

10 million bibliographic records are now available as RDF/XML.

Examples for the Sudoc record whose internal id is 132133520 :
. Resource URI :http://www.sudoc.fr/132133520/id
. Generic document :http://www.sudoc.fr/132133520 (content
negotiation is
supported)


Great job!

Is there an RDF dump anywhere?


Sorry, we don't provide any dump, as the 10 000 000 files are
generated on
the fly from Oracle (stored as XML type+ some more tables).
We provide a complete sitemap at
http://www.sudoc.fr/noticesbiblio/sitemap.txt , and we hope that
Sindice
will crawl the whole stuff.
Would it help ?

Any advice welcome,

Yann

--
--
Yann NICOLAS
Etudes&  Projets
ABES

Okay, no problem with sitemaps as dump alternatives re. getting
data
imported into Linked Data hubs such our LOD cloud cache and
Sindice etc..


--

Regards,

Kingsley Idehen
President&  CEO
OpenLink Software
Web:http://www.openlinksw.com
Weblog:http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen



--
--
Yann NICOLAS
Etudes&  Projets
ABES

Re: ANN: Sudoc bibliographic ans authority data

Reply via email to