Hi, maybe the following ideas are helpful for you: > Monica wrote > I have a system with the nutch configure.The all html pages that they are > generated dinamically with Servlets a JSP, are correctly indexing with crawl, > but I have a problem with the pdf and word files. My system save those files > in database and in my portal I have urls that they show those files. But, > those URLs are jsp, for example > http://www.mydomain.com/myportal/file.jsp?id=xx and this URL returns an pdf > file. The crawl doesn't reconize this contain. I test my system with URLs > http://www.mydomain.com/myportal/file.pdf, and in this case the nutch indexes > correctly.
If I recall it correctly nutch uses the following procedure (in that order) to determine a resource's mime-tpye: 1.) Try to use the extension to map the file to a mime-type (http://. .. ./file.jsp is possibly mapped to html instead of pdf). 2.) Use the HTTP-header's mime.-type info. 3.) Use magic number guessing. If one of these heuristics can come up with a mime-type the residual heuristics are not tried. An example for case 1.) is the following url: "http://en.wikipedia.org/wiki/EMM386.EXE". This file will be mapped to a mime-type called "dos/x-application" or sth. similar. Nutch will produce an error stating that it could not find a suitable plugin. However your browser will display this page correctly (http-header's mime-tpye). On Dec 13, 2007 3:36 PM, Mónica Lamas González <[EMAIL PROTECTED]> wrote: > Sorry, when I test the URL http://www.mydomain.com/myportal/file.pdf > <http://www.mydomain.com/myportal/file.pdf> , I don't obtain any result in my > searches. There is a limit for a resource's size. However, I found before (in my case) that this limit is set too low for pdf files. Files exceeding this limit will simply be truncated. This works fine for html files. However the pdf parsing plugin will fail if it is fed with a truncated pdf file. (see nutch-default.xml for a value named http.content.limit or similar and override it in nutch-site.xml). I don't know how the plugin for word files behaves in that situation. The things mentioned above are based on my experiences using the nutch default configuration. Hope it helps, Martin
