[DotNetDevelopment] multithreaded crawler

cengineer Thu, 04 Feb 2010 22:13:49 -0800

I have tried to implement a multithreaded crawler and I am having
quite a few issues with it. As a newbie multithreading seems to be
quite harder than I thought it would be.
Firstly this is how I initiate the threads:
[code/]
startwatch.Start()
            'myQ.Enqueue(thread)
            For Each link In completeList
                Try
                    Dim thread = New Thread(AddressOf processUrl)
                    numThread = numThread + 1
                    If numThread < 10 Then
                        ThreadList.Add(thread)
                    End If
                    thread.Start(link)
                Catch ex As Exception
                    MessageBox.Show(ex.Message.ToString, "Error
Message", MessageBoxButtons.OK)
                End Try
            Next



            For Each thread In ThreadList
                thread.Join()
            Next

            startwatch.Stop()
            elapsedTime = startwatch.ElapsedMilliseconds
[/code]

the idea is to read a list of urls and then have a few threads go and
fetch the pages. Once i get the pages I will parse the html to extract
more urls and then write these to the database. My problem is where I
try to go and parser the pages and write to the database. I use
synclock but still getting errors where it cannot access either the
file or database. At one point the program just crashes. Here is a
peek at the calling methods;

[code/]
If Not String.IsNullOrEmpty(html) Then
                'get all links first
                SyncLock html
                    links = parser.GetLinks(fromUrl, html)
                End SyncLock

For Each link As String In links
...
...
...

   Links_DBObj.insert_feedurls_link(link, feedlink, execError,
connObj_Generic, commObj_Generic)


[/code]

Does anyone have any suggestions? Others have mentioned using
synchronous queues etc but not too sure how to do that. What would be
the most effecient way to implement this? Have threads just fetch the
urls and the individually parse them or can I have the threads do that
too?

[DotNetDevelopment] multithreaded crawler

Reply via email to