As far as the google custom search solution, I'd add that sometimes it
yields weird results : for instance, we indexed a site and for a given
search term, google says "about 16 results" (we have 10 hits displayed
on the page) and when we click on page 2, it says "about 12 results"
(showing the two remaining hits). Ok, it says "about", but it's a bit
strange anyway that the system is not able to compute the proper number
of hits upfront (it occurs while using labels refinement.)
On the other hand, it's super easy to set up...
Le 20/02/2013 20:33, Nathan Tallman a écrit :
@Jason and @Michele: I'd rather stay away from a Google solution. The
reason being that they don't index everything. Our sitemap is submitted
nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure
Google indexes the PDFs or be sure that they always will. (If I'm
misunderstanding this, please let me know.)
@Péter: The VuFind solution I mentioned is very similar to what you use
here. It uses Aperture (although soon to use Tika instead) to grab the
full-text and shoves everything inside a solr index. The import is managed
through a PHP script the crawls every URL on the sitemap. The only part I
don't have is removing deleted, adding new, and updating changed
webpages/files. I'm not sure how to rework the script to use a list of new
files rather than the sitemap, but everything is on the same server so that
should work.
On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman <[email protected]> wrote:
My institution is looking for ways to provide search across PDFs through
our website. Specifically, PDFs linked from finding aids. Ideally searching
within a collection's PDFs or possibly across all PDFs linked from all
finding aids.
We do not have a CMS or a digital repository. A digital repository is on
the horizon, but it's a ways out and we need to offer the search sooner.
I've looked into Swish-e but haven't had much luck getting anything off the
ground.
One way we know we can do this through our discovery layer VuFind, using
it's ability to full-text index a website based on a sitemap (which would
includes PDFs linked from finding aids). Facets could be created for
collections, and we may be able to create a search box on the finding aid
nav that searches specifically that collection.
But, I'm not sure how scalable that solution is. The indexing agent cannot
discern when a page was updated, so it has to re-scrape,
everything, every-night. The impetus collection is going to have about over
1000 PDFs. And that's to start. Creating the index will start to take a
long, long time.
Does anyone have any ideas or know of any useful tools for this project?
Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
anyway :-)
Thanks,
Nathan
--
signature
*Julien Gibert*
Agence Bibliographique de l'Enseignement Supérieur
227, avenue Professeur Jean Louis Viala
34193 Montpellier cedex 5
Tél : 33 (0)4 67 54 84 07
Fax : 33 (0)4 67 54 84 14