Have you had a look at the feature added, and does it work for you?
I'd also still be interested in knowing where you are seeing
out-of-memory situations.

Karl

On Thu, Jun 23, 2011 at 8:03 AM, Karl Wright <[email protected]> wrote:
> Hi Erlend,
>
> I hope you are not seeing memory issues on large files with ManifoldCF
> itself.  That should not happen, and if it does we need to figure out
> why.
>
> Solr memory issues, on the other hand, I can believe.  If that is the
> problem, then I agree we should try to do something about it.
> Probably the right thing to do is (since it is a Solr limitation)
> adding a configuration parameter to the Solr connector that specifies
> the maximum size of a file the connection will accept.  Files larger
> than that should return a 400 if indexing is attempted, etc.
>
> Perhaps we should also consider adding a new method to the
> IOutputConnector interface that returns a maximum file size value, and
> expose that in IVersionActivity and IProcessActivity.  That would
> allow connectors to make output-based decisions as to whether they
> should fetch large files in the first place.
>
> Karl
>
>
> On Thu, Jun 23, 2011 at 7:32 AM, Erlend Garåsen <[email protected]> 
> wrote:
>>
>> I will create a ticket today. Post filtering sounds like a good idea.
>>
>> Another thing. We are facing memory problems with huge documents. Maybe we
>> should add another future in order to cope with such documents, for instance
>> skip documents which exceed a preset size. We have discovered pdfs on 500
>> MB. What do you think? Do we need such a future as well?
>>
>> Erlend
>>
>> On 23.06.11 12.08, Karl Wright wrote:
>>>
>>> Have there been any further developments on this thread?
>>> Karl
>>>
>>> On Tue, Jun 21, 2011 at 6:08 AM, Karl Wright<[email protected]>  wrote:
>>>>
>>>> Sure.  But you've already convinced me we need a new feature. ;-)
>>>>
>>>> Karl
>>>>
>>>> On Tue, Jun 21, 2011 at 3:50 AM, Erlend Garåsen<[email protected]>
>>>>  wrote:
>>>>>
>>>>> Sure, I can create a ticket. But first I want to discuss this issue with
>>>>> the
>>>>> two search consultants we have hired.
>>>>>
>>>>> I decided to post to the dev list in order to get some feedback on this
>>>>> issue.
>>>>>
>>>>> Erlend
>>>>>
>>>>> On 20.06.11 18.00, Karl Wright wrote:
>>>>>>
>>>>>> Hi Erlend,
>>>>>>
>>>>>> The inclusions and exclusions are based solely on URL, and block the
>>>>>> connector from fetching the file.  Otherwise you would easily wind up
>>>>>> fetching the entire web.
>>>>>>
>>>>>> However, this raises an interesting issue as to whether there's a way
>>>>>> in the web connector to do what you are trying to do, which is to
>>>>>> filter based on URL after links have been extracted.  The current
>>>>>> inclusions/exclusions work fine for any URLs without links but do not
>>>>>> allow for the case you are looking for.
>>>>>>
>>>>>> Can you create a ticket?  The suggestion would be to introduce
>>>>>> post-extraction inclusions and exclusions into the connector.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 20, 2011 at 10:53 AM, Erlend Garåsen
>>>>>> <[email protected]>    wrote:
>>>>>>>
>>>>>>> I just realized that if I exclude html files for a job, links in these
>>>>>>> files
>>>>>>> will not be followed. Is this a desirable behaviour? Should links be
>>>>>>> followed regardless of the exclude filter?
>>>>>>>
>>>>>>> I discovered this issue when I was going to crawl only pdfs and
>>>>>>> realized
>>>>>>> that the job ended without finding any documents at all. I think I had
>>>>>>> something like this in my include list:
>>>>>>> http://foreninger.uio.no/.*\.pdf$
>>>>>>> http://folk.uio.no/.*\.pdf$
>>>>>>>
>>>>>>> Erlend
>>>>>>>
>>>>>>> --
>>>>>>> Erlend Garåsen
>>>>>>> Center for Information Technology Services
>>>>>>> University of Oslo
>>>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>>>> 31050
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>> 31050
>>>>>
>>>>
>>
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Reply via email to