On Tue, 2010-05-11 at 08:20 -0700, Miguel De Anda wrote:
> i have an app that needs to crawl for data from another server. the service
> requires login, and it keeps track of the session via a cookie. i need the
> app
> to be able to login as thousands of different users to crawl for data.
>
> the implementation is as like this:
>
> thread 1:
> add a crawl job to the queue (threadpool of size N)
>
> job in queue (runs in threadpool):
> if not logged in as this user yet, login
> fetch data for user
> index data for search
>
>
> the dataset is huge. each user has thousands of records, and there are
> thousands of users. i initially had a separate instance of httpclient with
> it's own MultiThreadedHttpConnectionManager but as the number of users grew,
> the number of open connections grew, and stay open. this eventually caused
> "too many open files" errors.
>
> option 1:
> a new non-threaded httpclient that i can terminate the connection to once i'm
> done, manually manage cookies to prevent excessive logins. each job in the
> threadpool will create a new instance and shut it down when it's done. i
> implemented this, but couldn't figure out how to get it to shutdown fast. the
> only thing i found was setSoTimeout and i set it to 2000 but if the app goes
> really fast, it still doesn't prevent the lsof count from getting really
> high.
> i'm afraid to lower it even more as it might have other side effects that i'm
> not aware of. i also imagine this method creates lots of overhead.
>
> option 2:
> create a global static MultiThreadedHttpConnectionManager and httpclient
> instance and manually set the cookies to each httpget method that goes out.
> this method seems to have the least overhead as it can continue to reuse the
> connections and the maximum number of concurrent connections shouldn't really
> go higher than the threadpool size. sounds perfect but i don't know how to
> have it keep cookies separate. while using wireshark it seems that it's
> mixing
> up the cookies from multiple users (the one it remembers on it's own, and the
> one i add).
>
> here is a code snippet of what i have. i'm using httpclient v3.1 and i'm
> afraid to upgrade it as this system is already in production and i'm just
> trying to fix the "too many open files" issues.
>
> if (!loginCookies.isEmpty()) {
> String cookieValue = "";
> for (String cookie : loginCookies.keySet()) {
> if (!"".equals(cookieValue))
> cookieValue += "; ";
> cookieValue += cookie + "=" +
> loginCookies.get(cookie);
> }
> httpget.addRequestHeader("Cookie",
> cookieValue);
> }
>
> it results in this:
>
> GET
> /index.aspx?class=Subscriber&proc=List&action=listbyname&format=xml&page=1&page_limit=200
>
> HTTP/1.1
> User-Agent: mycustomuseragent
> Cookie:
> TASEREVIDENCECOM=DxNZ9PZkP4xPwhwhKPenSSDjrRtyoSr/Jlyn3iBe6B06lQdGkt+aQ/VXgn0A9GcQVyZ2RscFOW486IJiRX9XH1+0B9k1dbKfQhQVYApXS6ATfuKyOS0l2Qv7OoM/DQrb6VIJveLFt+FkbVAV9BkmfW2IWlGu2g89NvBJFNn8OHNpsZXLUBf7ph/qsLyUTwOYENW7y3xGYD9jOoroiqgj/4joNhP5DYKr+hQKCIS3Gzflo2r+Nq1aTqZ2EMHOcqzIseqB+qtFWikp/UMcouM9ZPN/fOuiYGTm1vT3E/Kt1j9peUDvd1ZbUlJ+YxuZANBl9fwFzSEGOwSi3Bt5Ai8WybAHJiTg4oP5lmnkgc/E65CqQa5nqhhO0irS7bR6Jk1Jh+7h3WS3ytNyqUTRUZAnpWZG+lP6Efv0cPYz4Nc8Aupt8l8HxDeV/cHV2JpcL7HkUr/mNcT1lwaUPeBfO7Gc4S+AbpObj5I7y+sFQ1DU67O3Wj30UImB9M8i2RNTTYH9aQtRfR+a4ZbzMQfN2MV0le5/W5/7/QZbkLzUVJu1yzfgaaFfkjVgTbo3qywQgUJhcBJ+DgHOVYjYYZR81wJia/s06odo/mTuhtRA857ctwP2k+37J2Zf0NzWi0+3tEV6rg+2o2HtVDxMBmwWFVqLWrWb1bFC0XxBGVHZVtvsjgqyk7Q8YjqWZl0CtVJUa5Z2xEybWjolII1zFMRnTe7XZ3/qfq1txMJXkGlL0VJy9fVVzcMDZ3ZaYoO/+i5ly6fDm6fGO02/Wpk09gkFc7V45Q==
> Host: 172.22.1.55
> Cookie: $Version=0;
> TASEREVIDENCECOM=DxNZ9PZkP4xPwhwhKPenSSDjrRtyoSr/Jlyn3iBe6B06lQdGkt+aQ/VXgn0A9GcQVyZ2RscFOW486IJiRX9XH1+0B9k1dbKfQhQVYApXS6ATfuKyOS0l2Qv7OoM/DQrb6VIJveLFt+FkbVAV9BkmfW2IWlGu2g89NvBJFNn8OHNpsZXLUBf7ph/qsLyUTwOYENW7y3xGYD9jOoroiqgj/4joNhP5DYKr+hQKCIS3Gzflo2r+Nq1aTqZ2EMHOcqzIseqB+qtFWikp/UMcouM9ZPN/fOuiYGTm1vT3E/Kt1j9peUDvd1ZbUlJ+YxuZANBl9fwFzSEGOwSi3Bt5Ai8WybAHJiTg4oP5lmnkgc/E65CqQa5nqhhO0irS7bR6Jk1Jh+7h3WS3ytNyqUTRUZAnpWZG+lP6Efv0cPYz4Nc8Aupt8l8HxDeV/cHV2JpcL7HkUr/mNcT1lwaUPeBfO7Gc4S+AbpObj5I7y+sFQ1DU67O3Wj30UImB9M8i2RNTTYH9aQtRfR+a4ZbzMQfN2MV0le5/W5/7/QZbkLzUVJu1yzfgaaFfkjVgTbo3qywQgUJhcBJ+DgHOVYjYYZR81wJia/s06odo/mTuhtRA857ctwP2k+37J2Zf0NzWi0+3tEV6rg+2o2HtVDxMBmwWFVqLWrWb1bFC0XxBGVHZVtvsjgqyk7Q8YjqWZl0CtVJUa5Z2xEybWjolII1zFMRnTe7XZ3/qfq1txMJXkGlL0VJy9fVVzcMDZ3ZaYoO/+i5ly6fDm6fGO02/Wpk09gkFc7V45Q==;
>
> $Path=/
>
>
> i set the first Cookie, the second is set by httpclient.
>
> thanks,
> miguel de anda
>
Miguel,
You should be using a separate HttpState per individual user and let
HttpClient manage cookies.
Oleg
PS: 3.1 is effectively end of life. It will become more and more
difficult to get any help for it on this list
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]