Hi Nathan,

you can use Apache Solr. It doesn't need to reindex the whole fileset
every time, you can add/update individual items when needed.
It uses Apache Tika a text extraction tool, which can distille the
text from PDF.
You can create a custom bash script which runs every day, and sends
files to Solr based on file dates.
Let me know if you need more help in this.

Regards,
Péter


2013/2/20 Michele R Combs <mrrot...@syr.edu>:
> What about just a Google site search?
>
> -----Original Message-----
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
> Nathan Tallman
> Sent: Wednesday, February 20, 2013 12:54 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Providing Search Across PDFs
>
> My institution is looking for ways to provide search across PDFs through our 
> website. Specifically, PDFs linked from finding aids. Ideally searching 
> within a collection's PDFs or possibly across all PDFs linked from all 
> finding aids.
>
> We do not have a CMS or a digital repository. A digital repository is on the 
> horizon, but it's a ways out and we need to offer the search sooner.
> I've looked into Swish-e but haven't had much luck getting anything off the 
> ground.
>
> One way we know we can do this through our discovery layer VuFind, using it's 
> ability to full-text index a website based on a sitemap (which would includes 
> PDFs linked from finding aids). Facets could be created for  collections, and 
> we may be able to create a search box on the finding aid nav that searches 
> specifically that collection.
>
> But, I'm not sure how scalable that solution is. The indexing agent cannot 
> discern when a page was updated, so it has to re-scrape, everything, 
> every-night. The impetus collection is going to have about over
> 1000 PDFs. And that's to start. Creating the index will start to take a long, 
> long time.
>
> Does anyone have any ideas or know of any useful tools for this project?
> Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway 
> :-)
>
> Thanks,
> Nathan



-- 
Péter Király
software developer

Europeana - http://europeana.eu
eXtensible Catalog - http://eXtensibleCatalog.org

Reply via email to