i have an app that needs to crawl for data from another server. the service 
requires login, and it keeps track of the session via a cookie. i need the app 
to be able to login as thousands of different users to crawl for data.

the implementation is as like this:

thread 1:
add a crawl job to the queue (threadpool of size N)

job in queue (runs in threadpool):
if not logged in as this user yet, login
fetch data for user
index data for search


the dataset is huge. each user has thousands of records, and there are 
thousands of users. i initially had a separate instance of httpclient with 
it's own MultiThreadedHttpConnectionManager but as the number of users grew, 
the number of open connections grew, and stay open. this eventually caused 
"too many open files" errors. 

option 1:
a new non-threaded httpclient that i can terminate the connection to once i'm 
done, manually manage cookies to prevent excessive logins. each job in the 
threadpool will create a new instance and shut it down when it's done. i 
implemented this, but couldn't figure out how to get it to shutdown fast. the 
only thing i found was setSoTimeout and i set it to 2000 but if the app goes 
really fast, it still doesn't prevent the lsof count from getting really high. 
i'm afraid to lower it even more as it might have other side effects that i'm 
not aware of. i also imagine this method creates lots of overhead.

option 2:
create a global static MultiThreadedHttpConnectionManager and httpclient 
instance and manually set the cookies to each httpget method that goes out. 
this method seems to have the least overhead as it can continue to reuse the 
connections and the maximum number of concurrent connections shouldn't really 
go higher than the threadpool size. sounds perfect but i don't know how to 
have it keep cookies separate. while using wireshark it seems that it's mixing 
up the cookies from multiple users (the one it remembers on it's own, and the 
one i add).

here is a code snippet of what i have. i'm using httpclient v3.1 and i'm 
afraid to upgrade it as this system is already in production and i'm just 
trying to fix the "too many open files" issues.

                        if (!loginCookies.isEmpty()) {
                                String cookieValue = "";
                                for (String cookie : loginCookies.keySet()) {
                                        if (!"".equals(cookieValue))
                                                cookieValue += "; ";
                                        cookieValue += cookie + "=" + 
loginCookies.get(cookie);
                                }
                                httpget.addRequestHeader("Cookie", 
cookieValue);
                        }

it results in this:

GET 
/index.aspx?class=Subscriber&proc=List&action=listbyname&format=xml&page=1&page_limit=200
 
HTTP/1.1
User-Agent: mycustomuseragent
Cookie: 
TASEREVIDENCECOM=DxNZ9PZkP4xPwhwhKPenSSDjrRtyoSr/Jlyn3iBe6B06lQdGkt+aQ/VXgn0A9GcQVyZ2RscFOW486IJiRX9XH1+0B9k1dbKfQhQVYApXS6ATfuKyOS0l2Qv7OoM/DQrb6VIJveLFt+FkbVAV9BkmfW2IWlGu2g89NvBJFNn8OHNpsZXLUBf7ph/qsLyUTwOYENW7y3xGYD9jOoroiqgj/4joNhP5DYKr+hQKCIS3Gzflo2r+Nq1aTqZ2EMHOcqzIseqB+qtFWikp/UMcouM9ZPN/fOuiYGTm1vT3E/Kt1j9peUDvd1ZbUlJ+YxuZANBl9fwFzSEGOwSi3Bt5Ai8WybAHJiTg4oP5lmnkgc/E65CqQa5nqhhO0irS7bR6Jk1Jh+7h3WS3ytNyqUTRUZAnpWZG+lP6Efv0cPYz4Nc8Aupt8l8HxDeV/cHV2JpcL7HkUr/mNcT1lwaUPeBfO7Gc4S+AbpObj5I7y+sFQ1DU67O3Wj30UImB9M8i2RNTTYH9aQtRfR+a4ZbzMQfN2MV0le5/W5/7/QZbkLzUVJu1yzfgaaFfkjVgTbo3qywQgUJhcBJ+DgHOVYjYYZR81wJia/s06odo/mTuhtRA857ctwP2k+37J2Zf0NzWi0+3tEV6rg+2o2HtVDxMBmwWFVqLWrWb1bFC0XxBGVHZVtvsjgqyk7Q8YjqWZl0CtVJUa5Z2xEybWjolII1zFMRnTe7XZ3/qfq1txMJXkGlL0VJy9fVVzcMDZ3ZaYoO/+i5ly6fDm6fGO02/Wpk09gkFc7V45Q==
Host: 172.22.1.55
Cookie: $Version=0; 
TASEREVIDENCECOM=DxNZ9PZkP4xPwhwhKPenSSDjrRtyoSr/Jlyn3iBe6B06lQdGkt+aQ/VXgn0A9GcQVyZ2RscFOW486IJiRX9XH1+0B9k1dbKfQhQVYApXS6ATfuKyOS0l2Qv7OoM/DQrb6VIJveLFt+FkbVAV9BkmfW2IWlGu2g89NvBJFNn8OHNpsZXLUBf7ph/qsLyUTwOYENW7y3xGYD9jOoroiqgj/4joNhP5DYKr+hQKCIS3Gzflo2r+Nq1aTqZ2EMHOcqzIseqB+qtFWikp/UMcouM9ZPN/fOuiYGTm1vT3E/Kt1j9peUDvd1ZbUlJ+YxuZANBl9fwFzSEGOwSi3Bt5Ai8WybAHJiTg4oP5lmnkgc/E65CqQa5nqhhO0irS7bR6Jk1Jh+7h3WS3ytNyqUTRUZAnpWZG+lP6Efv0cPYz4Nc8Aupt8l8HxDeV/cHV2JpcL7HkUr/mNcT1lwaUPeBfO7Gc4S+AbpObj5I7y+sFQ1DU67O3Wj30UImB9M8i2RNTTYH9aQtRfR+a4ZbzMQfN2MV0le5/W5/7/QZbkLzUVJu1yzfgaaFfkjVgTbo3qywQgUJhcBJ+DgHOVYjYYZR81wJia/s06odo/mTuhtRA857ctwP2k+37J2Zf0NzWi0+3tEV6rg+2o2HtVDxMBmwWFVqLWrWb1bFC0XxBGVHZVtvsjgqyk7Q8YjqWZl0CtVJUa5Z2xEybWjolII1zFMRnTe7XZ3/qfq1txMJXkGlL0VJy9fVVzcMDZ3ZaYoO/+i5ly6fDm6fGO02/Wpk09gkFc7V45Q==;
 
$Path=/


i set the first Cookie, the second is set by httpclient.

thanks,
miguel de anda

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to