Hi Nathan, you can use Apache Solr. It doesn't need to reindex the whole fileset every time, you can add/update individual items when needed. It uses Apache Tika a text extraction tool, which can distille the text from PDF. You can create a custom bash script which runs every day, and sends files to Solr based on file dates. Let me know if you need more help in this.
Regards, Péter 2013/2/20 Michele R Combs <mrrot...@syr.edu>: > What about just a Google site search? > > -----Original Message----- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Nathan Tallman > Sent: Wednesday, February 20, 2013 12:54 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: [CODE4LIB] Providing Search Across PDFs > > My institution is looking for ways to provide search across PDFs through our > website. Specifically, PDFs linked from finding aids. Ideally searching > within a collection's PDFs or possibly across all PDFs linked from all > finding aids. > > We do not have a CMS or a digital repository. A digital repository is on the > horizon, but it's a ways out and we need to offer the search sooner. > I've looked into Swish-e but haven't had much luck getting anything off the > ground. > > One way we know we can do this through our discovery layer VuFind, using it's > ability to full-text index a website based on a sitemap (which would includes > PDFs linked from finding aids). Facets could be created for collections, and > we may be able to create a search box on the finding aid nav that searches > specifically that collection. > > But, I'm not sure how scalable that solution is. The indexing agent cannot > discern when a page was updated, so it has to re-scrape, everything, > every-night. The impetus collection is going to have about over > 1000 PDFs. And that's to start. Creating the index will start to take a long, > long time. > > Does anyone have any ideas or know of any useful tools for this project? > Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty anyway > :-) > > Thanks, > Nathan -- Péter Király software developer Europeana - http://europeana.eu eXtensible Catalog - http://eXtensibleCatalog.org