Hi Nicolas, if rules were wrong, i cannot obtain any feeds.
my rule is this
rules =
(Rule(SgmlLinkExtractor(allow=('/detalle.asp\?idb\=\d+')),'parse_espia',follow=True),)
i start in www.domain.com and only i want capture data in
url: http://www.domain.com/detalle.asp?idb=<number> for example
http://www.domain.com/detalle.asp?idb=2856
El jueves, 18 de diciembre de 2014 22:26:31 UTC+1, Nicolás Alejandro
Ramírez Quiros escribió:
>
> That limit doesn't exist; the problem lives in your code. You mention that
> you are using Rules, are your regex correct?
>
> El jueves, 18 de diciembre de 2014 06:49:36 UTC-2, ROBERTO ANGUITA MARTIN
> escribió:
>>
>> i must get 600 items , but like is closed in 34, i cannot obtain all.
>> The web scanned have many sub links and i have filer by rule the url
>> allow and domain allowed
>>
>> Can obteis more information in some way , for known why abort?, exist
>> some limit in item filter?
>>
>> El jueves, 18 de diciembre de 2014 00:06:30 UTC+1, Travis Leleu escribió:
>>>
>>> What makes you think it's closing prematurely? I see a lot of duplicate
>>> requests filtered out by scrapy; if you aren't getting as many items as you
>>> expected, that could be why. Check your assumptions.
>>>
>>> On Wed, Dec 17, 2014 at 2:31 PM, ROBERTO ANGUITA MARTIN <
>>> [email protected]> wrote:
>>>>
>>>> I am trying my first crawl
>>>> i launch my scrap with this command:
>>>>
>>>> nohup scrapy crawl prueba -o prueba.csv -t csv -s LOG_FILE=salida.out
>>>> -s JOBDIR=work -L DEBUG &
>>>>
>>>>
>>>> and i have configure CsvExportPipeline.py like in manual example, but
>>>> when spider has scraped 34 FEEDS finish.
>>>>
>>>> Why? i are surfing in internet and everybody said memory problem but i
>>>> don't found any about memory in log.
>>>>
>>>> Log level is in DEBUG but i cannot known reason why only read 34 items
>>>>
>>>>
>>>> Final log is this:
>>>>
>>>>
>>>>
>>>> 2014-12-17 17:02:32+0100 [prueba] INFO: Closing spider (finished)
>>>>
>>>> 2014-12-17 17:02:32+0100 [prueba] INFO: Stored csv feed (34 items)
>>>> in: prueba.csv
>>>>
>>>> 2014-12-17 17:02:32+0100 [prueba] INFO: Dumping Scrapy stats:
>>>>
>>>> {'downloader/request_bytes': 14603,
>>>>
>>>> 'downloader/request_count': 35,
>>>>
>>>> 'downloader/request_method_count/GET': 35,
>>>>
>>>> 'downloader/response_bytes': 551613,
>>>>
>>>> 'downloader/response_count': 35,
>>>>
>>>> 'downloader/response_status_count/200': 35,
>>>>
>>>> 'dupefilter/filtered': 363,
>>>>
>>>> 'finish_reason': 'finished',
>>>>
>>>> 'finish_time': datetime.datetime(2014, 12, 17, 16, 2, 32, 392134),
>>>>
>>>> 'item_scraped_count': 34,
>>>>
>>>> 'log_count/DEBUG': 72,
>>>>
>>>> 'log_count/ERROR': 1,
>>>>
>>>> 'log_count/INFO': 48,
>>>>
>>>> 'request_depth_max': 5,
>>>>
>>>> 'response_received_count': 35,
>>>>
>>>> 'scheduler/dequeued': 35,
>>>>
>>>> 'scheduler/dequeued/disk': 35,
>>>>
>>>> 'scheduler/enqueued': 35,
>>>>
>>>> 'scheduler/enqueued/disk': 35,
>>>>
>>>> 'start_time': datetime.datetime(2014, 12, 17, 15, 21, 55, 218630)}
>>>>
>>>> 2014-12-17 17:02:32+0100 [bodegas] INFO: Spider closed (finished)
>>>>
>>>>
>>>> Can anybody help me?
>>>>
>>>>
>>>> Regards
>>>>
>>>> Roberto
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.