|
||||||||
|
This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira |
||||||||
------------------------------------------------------------------------------ Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________ Dspace-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-devel

>From what I can recall from my phone conversation with Anurag at Google Scholar, he said something very similar. Without any "citation_pdf_url", Google Scholar is left with no choice but to "guess" which file seems most important (and in many cases, it may guess wrong -- which may actually be part of the reason we've seen DS-1387, though it's unconfirmed that these two issues are entirely related). In talking with Anurag, I had made it clear that we don't always have a PDF -- he said that didn't matter as much anymore, and that any file format is fine in the "citation_pdf_url" as long as it can give Google Scholar some sort of "hint" as to what is likely the most important file to index.
As Richard notes, this "hint" may not be perfect for all DSpace use cases. But, I still think it's better to have *something* in "citation_pdf_url" than nothing. Even if we end up having a "index.html" file linked in the "citation_pdf_url", it at least potentially gives Google Scholar a better starting point for indexing all the files, rather than having it guess based on the entire file listing. I also suspect that this "index.html" example may only encompass a smaller percentage of content in DSpace instances worldwide (though I only have anecdotal evidence). Obviously it'd be great if we can eventually make this configurable somehow to cover 100% of all content use cases...but, if we can get to 95% coverage with a few simple tweaks (that Andrea suggests), that's a huge step forward in itself.
If need be, I can bring questions back to Anurag, or we can let him know that there may be cases where the "citation_pdf_url" just points at an index page, and that the Scholar crawler may want to crawl pages linked from that index page. Anurag seems willing to work with us on how to improve this process...he just needs DSpace to give the Scholar crawler some better "hints" to work from. Without those hints, we are really leaving the Scholar crawler "in the dark" to root around for whatever it can find.