No length of FFT will distinguish between a mixture of these sine waves and a single amplitude-modulated one, because they're mathematically identitical! Specifically:

sin(440t) + sin(441t) = 2*cos(0.5t)*sin(440.5t)

So the question isn't whether an algorithm can distinguish between them but rather which one of these two interpretations it should pick. And I would say in most audio applications the best answer is that it should pick the same interpretation that the human hearing system would. In this example it's clearly the right-hand side. In the case of a large separation (e.g. 440Hz and 550Hz, a major third) it's clearly the left-hand side. And somewhere in between I guess it must be a toss-up.

I guess you could model both simultaneously, with some kind of probability weighting.

