Have you had a look at the feature added, and does it work for you? I'd also still be interested in knowing where you are seeing out-of-memory situations.
Karl On Thu, Jun 23, 2011 at 8:03 AM, Karl Wright <[email protected]> wrote: > Hi Erlend, > > I hope you are not seeing memory issues on large files with ManifoldCF > itself. That should not happen, and if it does we need to figure out > why. > > Solr memory issues, on the other hand, I can believe. If that is the > problem, then I agree we should try to do something about it. > Probably the right thing to do is (since it is a Solr limitation) > adding a configuration parameter to the Solr connector that specifies > the maximum size of a file the connection will accept. Files larger > than that should return a 400 if indexing is attempted, etc. > > Perhaps we should also consider adding a new method to the > IOutputConnector interface that returns a maximum file size value, and > expose that in IVersionActivity and IProcessActivity. That would > allow connectors to make output-based decisions as to whether they > should fetch large files in the first place. > > Karl > > > On Thu, Jun 23, 2011 at 7:32 AM, Erlend Garåsen <[email protected]> > wrote: >> >> I will create a ticket today. Post filtering sounds like a good idea. >> >> Another thing. We are facing memory problems with huge documents. Maybe we >> should add another future in order to cope with such documents, for instance >> skip documents which exceed a preset size. We have discovered pdfs on 500 >> MB. What do you think? Do we need such a future as well? >> >> Erlend >> >> On 23.06.11 12.08, Karl Wright wrote: >>> >>> Have there been any further developments on this thread? >>> Karl >>> >>> On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright<[email protected]> wrote: >>>> >>>> Sure. But you've already convinced me we need a new feature. ;-) >>>> >>>> Karl >>>> >>>> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen<[email protected]> >>>> wrote: >>>>> >>>>> Sure, I can create a ticket. But first I want to discuss this issue with >>>>> the >>>>> two search consultants we have hired. >>>>> >>>>> I decided to post to the dev list in order to get some feedback on this >>>>> issue. >>>>> >>>>> Erlend >>>>> >>>>> On 20.06.11 18.00, Karl Wright wrote: >>>>>> >>>>>> Hi Erlend, >>>>>> >>>>>> The inclusions and exclusions are based solely on URL, and block the >>>>>> connector from fetching the file. Otherwise you would easily wind up >>>>>> fetching the entire web. >>>>>> >>>>>> However, this raises an interesting issue as to whether there's a way >>>>>> in the web connector to do what you are trying to do, which is to >>>>>> filter based on URL after links have been extracted. The current >>>>>> inclusions/exclusions work fine for any URLs without links but do not >>>>>> allow for the case you are looking for. >>>>>> >>>>>> Can you create a ticket? The suggestion would be to introduce >>>>>> post-extraction inclusions and exclusions into the connector. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> I just realized that if I exclude html files for a job, links in these >>>>>>> files >>>>>>> will not be followed. Is this a desirable behaviour? Should links be >>>>>>> followed regardless of the exclude filter? >>>>>>> >>>>>>> I discovered this issue when I was going to crawl only pdfs and >>>>>>> realized >>>>>>> that the job ended without finding any documents at all. I think I had >>>>>>> something like this in my include list: >>>>>>> http://foreninger.uio.no/.*\.pdf$ >>>>>>> http://folk.uio.no/.*\.pdf$ >>>>>>> >>>>>>> Erlend >>>>>>> >>>>>>> -- >>>>>>> Erlend Garåsen >>>>>>> Center for Information Technology Services >>>>>>> University of Oslo >>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>>>>>> 31050 >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Erlend Garåsen >>>>> Center for Information Technology Services >>>>> University of Oslo >>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>>>> 31050 >>>>> >>>> >> >> >> -- >> Erlend Garåsen >> Center for Information Technology Services >> University of Oslo >> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >> >
